Skip to main content

Data Governance, Data Quality, and Master Data

πŸ“ Where we are: The last chapter proved that a single computer system is fit for purpose; this one builds the organizational backbone that keeps all of them trustworthy β€” the policies, roles, and definitions that make the integrity and validation controls earlier in this part actually stick.

In the previous chapter we learned how to trust one system. Computerized System Validation (CSV) proves a piece of software is fit for its job; the GAMP 5 framework β€” the pharmaceutical industry's risk-based playbook, organized around software categories and the V-model that pairs each requirement with a matching test β€” tells you how much proof is enough; and an emerging approach, Computer Software Assurance (CSA) β€” formalized in FDA guidance issued in 2025 β€” shifts the emphasis from stacks of paperwork toward critical thinking about real risk. But validating each system one at a time leaves a gap. A plant may run a dozen individually validated systems β€” a manufacturing execution system (MES) such as Siemens Opcenter or KΓΆrber PAS-X, a laboratory system, an enterprise resource planning (ERP) system β€” and still drown in contradictory, untraceable, unconnectable data, because nobody agreed on the rules that govern data across them. Closing that gap is the job of data governance.

The simple version

Think of a large library. Buying good shelves and a working checkout scanner (the technical systems) is not enough. Someone has to decide who is allowed to add books, how every book is labeled and shelved, what counts as "the same" book when three editions arrive, and who fixes a mis-catalogued title. Without those rules and those people, the building fills with books nobody can find. Data governance is the librarian's rulebook for an entire factory's data.

What this chapter covers​

We start by defining data governance and its three custodial roles, then walk the dimensions of data quality that tell us whether data is any good, then the metadata that gives raw numbers meaning, and finally master data management β€” the shared definitions of materials, equipment, and products that prevent identifier chaos. The thread tying it all together: you cannot connect data you have not first governed.

Data governance: rules and the people who own them​

Data governance is the exercise of authority and control over the management of data β€” the system of decision rights and accountabilities that says who can do what, with which data, under what rules [1]. It is not software; it is the organizational layer above software. Research on the subject frames governance as a set of decision domains β€” principles, quality, metadata, access, and lifecycle β€” each paired with a clear locus of accountability: a named answer to "who decides?" [3].

In regulated biomanufacturing, governance is not optional housekeeping. The World Health Organization is explicit that senior management is responsible for an effective data governance system, embedded in the company's quality system, that applies the integrity principles across the whole data lifecycle [8]. The pharmaceutical engineering guidance from ISPE (the GAMP records and data integrity framework) similarly frames a data governance framework as the umbrella under which data integrity controls live [5]. Governance, in other words, is the connective tissue between the abstract demand "data must have integrity" and the concrete controls β€” audit trails, access limits, validation β€” that deliver it.

Three roles share the work, and confusing them is a classic failure mode [1]:

  • The data owner is accountable β€” the businessperson who answers for a data domain (say, "all batch records") and sets the rules for it.
  • The data steward is responsible β€” the subject-matter expert who tends the data day to day, defines what each field means, and resolves quality problems.
  • The data custodian is the technical caretaker β€” typically IT β€” who runs the storage, backups, and access controls, but does not decide the business meaning of the data.

Governance flows from a management mandate, through policy, to three distinct roles β€” accountability, stewardship, and technical custody β€” that together produce trustworthy data. Figure by the authors.

note

A useful shorthand: the owner is accountable (their name is on it), the steward is responsible (they do the work), and the custodian keeps the keys (they run the infrastructure). One person can wear two hats in a small lab, but the three duties must each have a home.

A master-data governance lifecycle: define, capture, steward, reconcile, and publish stages turning scattered system-local records into one trusted master source A master-data governance lifecycle turns scattered records into one trusted source. Original diagram by the authors, created with AI assistance.

The dimensions of data quality​

"Good data" is too vague to manage. The discipline breaks quality into measurable dimensions β€” distinct properties you can check one at a time. The landmark study of what data quality means to the people who actually use data grouped these into four families: intrinsic (is the data correct in itself?), contextual (is it right for the task at hand?), representational (is it presented clearly?), and accessibility (can you get at it?) [2]. The international data-quality standard, ISO 8000, similarly frames quality as the degree to which data meets stated requirements [9]. In bioprocessing the dimensions that bite hardest are:

  • Accuracy β€” does the value match reality? A pH reading of 7.2 must reflect the actual broth, not a drifted, uncalibrated probe.
  • Completeness β€” is anything missing? A batch record with a blank in-process result is not merely untidy; under cGMP (current Good Manufacturing Practice) it is a data-integrity defect [6]. U.S. regulation makes this concrete: 21 CFR 211.188 requires batch production and control records to document all in-process testing and results, so a blank field where a result belongs is a compliance violation, not a stylistic one.
  • Consistency β€” does the same fact agree across systems? If the process historian (such as the AVEVA PI System, formerly OSIsoft PI) says the batch finished at 14:03 and the manufacturing system says 15:03, at least one is wrong.
  • Timeliness β€” is the data available when it is needed, and recorded when the event happened? Integrity guidance calls this being contemporaneous β€” recorded at the time of the activity, not reconstructed later [6].
  • Uniqueness β€” is each real-world thing represented exactly once? Two records for the same material lot is a recipe for mixing the wrong components.
  • Validity β€” does the value obey its rules? A temperature of "βˆ’500 Β°C" or a date of "2026-13-40" is invalid on its face.

These dimensions are the quality face of data integrity, which regulators define as data that is complete, consistent, and accurate across its lifecycle [6]. The ALCOA+ attributes of Chapter 9 and these quality dimensions overlap by design β€” ALCOA+ frames them as integrity requirements a record must satisfy, while the dimensions frame them as measurable properties you can score; Complete the attribute and Completeness the dimension are the same idea seen from two angles, not two different things. The GxP integrity guidance β€” GxP being the umbrella term for the "Good x Practice" regulations (manufacturing, laboratory, clinical, and the rest) β€” frames the same idea through the ALCOA principles and stresses data criticality and risk assessment β€” spending the most quality effort where a wrong number would most endanger the patient or the product [7].

caution

Quality dimensions can conflict. Demanding more completeness (capture everything) can hurt timeliness (it takes longer), and chasing perfect accuracy on every trivial field wastes effort better spent on the critical ones. Good governance ranks data by criticality and applies the dimensions proportionately [7] β€” a risk-based stance that directly echoes the CSA mindset from the previous chapter. For example, a drifting temperature sensor in a mammalian cell-culture bioreactor directly affects product titer and patient safety (high criticality, warranting 100% review), while humidity logging in a storage room (low criticality) can rely on sampling-based checks.

Metadata: the context that makes data mean something​

Back in Chapter 1, The Biologic and Its Data Shadow, we saw that a bare number β€” 37 β€” is meaningless without its context. Metadata is "data about data": the surrounding context that tells you the number is a temperature, in degrees Celsius, from bioreactor BR-101, at 14:03:22, recorded by operator J. Lee. In a modern system that same reading is stored not as a lone 37 but as a structured object carrying its own context:

{
"value": 37,
"unit": "Β°C",
"equipment_id": "BR-101",
"timestamp": "2026-06-14T14:03:22Z",
"operator": "J. Lee",
"sensor_id": "TEMP-001"
}

Without that surrounding context, the number cannot be interpreted, audited, or trusted. The integrity guidances are blunt that metadata is part of the record β€” data without its metadata is not complete data [6][7].

Governing metadata is therefore one of the named governance domains [3], and managing it well is what makes data findable and reusable rather than a write-only graveyard. The widely adopted FAIR principles β€” that data should be Findable, Accessible, Interoperable, and Reusable β€” put rich, machine-actionable metadata at their very center: data is only findable and reusable if it carries metadata a computer, not just a human, can read [4]. This is the hinge of the whole chapter. Metadata governed today is what lets a machine connect your data to someone else's tomorrow.

Master data management: one definition for "the same thing"​

Metadata gives a single value its meaning; master data does the same job for the entities that values refer to β€” and here the payoff is the most concrete in the chapter. Master data is the shared reference data that describes the core entities a business runs on β€” its materials, its equipment, its products, its analytical methods, its suppliers. Unlike transaction data (which records events, like "batch 4471 started"), master data records things that persist and are referenced everywhere.

The problem master data solves is identifier chaos. The same raw material might be "Glucose" in the manufacturing system, "Dextrose" in the lab system (a LIMS or ELN such as LabWare, Waters NuGenesis, or Labguru), and "GLC-001" in the inventory system. To a human these are obviously the same sugar; to software they are three unrelated strings, and any attempt to total up usage, trace a lot, or compare batches silently breaks. The same trap catches more complex items: a fermentation medium might be "CHO growth medium" in the MES, "Buffer A" on the bench, and "RAW-MAT-2847" in ERP β€” three names for one material, with no way to reconcile lot traceability. Master Data Management (MDM) is the discipline of maintaining a single, authoritative, governed definition of each such entity, and propagating it consistently to every system that uses it [1].

Master data management replaces three system-local names for one material with a single governed master record that every system points to. Figure by the authors.

The international standard ISO 8000, originally a general industrial data-quality standard, provides a framework that the biopharmaceutical industry has adopted for exactly this: data-quality principles and a structure for master data and for exchanging it cleanly across organizations. ISO 8000 describes a series whose master-data parts β€” the ISO 8000-100 series, including ISO 8000-110:2021 β€” specify cross-organizational master-data exchange [9]. That cross-organizational reach matters, because a biopharmaceutical product is made by a web of partners β€” drug substance here, fill-finish there, testing somewhere else β€” and none of their data can be combined unless they first agree on what each material and method is.

Why it matters​

For data management, governance is the difference between having data and being able to use it. The technical controls of the previous three chapters β€” validated systems, audit trails, access limits β€” are necessary but inert without the human layer that decides the rules and the definitions. Three roles assign accountability; six quality dimensions make "good" measurable; metadata makes numbers interpretable; and master data makes "the same thing" actually the same across every system. Skip the governance and you get the worst outcome in the field: a fast, well-validated pipeline that efficiently moves untrustworthy and unconnectable data. Every later ambition β€” analytics, a digital twin, regulatory submission by data rather than by document β€” rests on this foundation.

In the real world​

Regulators have made governance a front-line expectation, not a nicety. The FDA's data-integrity guidance ties cGMP compliance to data being complete, consistent, and accurate, with reviewed audit trails [6]; the MHRA requires a documented data governance system sized to data criticality and risk [7]; and the WHO places ultimate responsibility on senior management [8]. On the standards side, ISO 8000 gives master-data work an internationally agreed footing [9], and FAIR has become the shared vocabulary for governing scientific data toward reuse [4]. A common failure shows what happens when that governance is missing: a lab introduces a new raw-material variant but the MES is never updated, so the steward quietly reconciles two spreadsheets by hand. The owner has no clear authority to mandate the MES change, and months pass before anyone notices that a batch record can no longer trace back to the true lot number. The fix is not more software; it is clearer accountability.

This is precisely the terrain of the U.S. NIIMBL institute and its Big Data Program / real-time lab-data and interoperability efforts: before instruments and partner organizations can share data in real time, they must agree on owners, definitions, and master records β€” governance first, connection second. The hardest part of "connecting the data" turns out to be the human agreement that has to happen before a single byte moves.

Key terms​

  • Data governance β€” the system of decision rights and accountabilities over data: who may do what, with which data, under what rules.
  • Data owner β€” the businessperson accountable for a data domain and its rules.
  • Data steward β€” the subject-matter expert responsible for defining meaning and fixing quality day to day.
  • Data custodian β€” the technical caretaker (usually IT) who runs storage, backups, and access.
  • Data quality dimension β€” a distinct, checkable property of data, such as accuracy or completeness.
  • Data integrity β€” data that is complete, consistent, and accurate across its lifecycle.
  • Data criticality β€” how much a data error would endanger the patient or product; used to focus quality effort.
  • Metadata β€” data about data; the context that makes a raw value interpretable and trustworthy.
  • FAIR β€” Findable, Accessible, Interoperable, Reusable; principles centered on rich machine-actionable metadata.
  • Master data β€” the persistent reference data describing core entities: materials, equipment, products, methods.
  • Master Data Management (MDM) β€” maintaining a single governed definition of each entity and propagating it everywhere.
  • ISO 8000 β€” the international standard for data quality and master data, including cross-organizational exchange.

Where this leads​

We have now governed our data: it has owners, measurable quality, rich metadata, and master records that agree on what each thing is. You might expect that to be the end of the connection problem β€” but it is only the beginning. Even when two systems exchange data with flawless syntactic interoperability (the bytes parse, the fields line up), the numbers still may not connect, because the same real-world thing is described differently in different places β€” different units, different identifiers, different timestamps, different vocabularies. The next chapter, Why Numbers Don't Connect: The Semantic Interoperability Problem, names this heterogeneity head-on and shows why it is what ultimately drives the field toward ontologies and FAIR.