Data Governance, Data Quality, and Master Data

📍 Where we are: The last chapter proved that a single computer system is fit for purpose; this one builds the organizational backbone that keeps all of them trustworthy — the policies, roles, and definitions that make the integrity and validation controls earlier in this part actually stick.

The last chapter, From CSV to CSA, showed how to validate one system — and how FDA's 2025 Computer Software Assurance guidance for device production and quality-system software, which the drug and biologics industry has adopted by analogy, shifts that effort from stacks of paperwork toward critical thinking about real risk. But a plant runs a dozen individually validated systems — a manufacturing execution system (MES) such as Siemens Opcenter or Körber PAS-X that directs and records what happens on the production floor, a laboratory system that captures test results, an enterprise resource planning (ERP) system that tracks materials, inventory, and orders — and can still drown in contradictory, unconnectable data, because nobody agreed on the rules that govern data across them. Closing that gap is the job of data governance.

The simple version

Think of a large library. Buying good shelves and a working checkout scanner (the technical systems) is not enough. Someone has to decide who is allowed to add books, how every book is labeled and shelved, what counts as "the same" book when three editions arrive, and who fixes a mis-catalogued title. Without those rules and those people, the building fills with books nobody can find. Data governance is the librarian's rulebook for an entire factory's data.

What this chapter covers

We start by defining data governance and its three custodial roles, then walk the dimensions of data quality that tell us whether data is any good, then the metadata that gives raw numbers meaning, then master data management — the shared definitions of materials, equipment, and products that prevent identifier chaos — and we dissect the anatomy of one governed master record before turning to what is still genuinely unsolved: harmonizing master data across supply-chain partners. The thread tying it all together: you cannot connect data you have not first governed.

Roles and quality: who owns the data and how good it is

Data governance: rules and the people who own them

Data governance is the exercise of authority and control over the management of data — the system of decision rights and accountabilities that says who can do what, with which data, under what rules [1]. It is not software; it is the organizational layer above software. The data-management body of knowledge (DAMA-DMBOK) frames these as distinct accountabilities [1], and research on the subject frames governance as a set of decision domains — principles (the high-level rules data must obey), quality (how good it has to be), metadata (the context attached to it), access (who may see or change it), and lifecycle (how it is created, kept, and retired) — each paired with a clear locus of accountability: a named answer to "who decides?" [3].

In regulated biomanufacturing, governance is not optional housekeeping. The World Health Organization is explicit that senior management is responsible for an effective data governance system, embedded in the company's quality system, that applies the integrity principles across the whole data lifecycle [8]. The pharmaceutical engineering guidance from ISPE (the GAMP — Good Automated Manufacturing Practice — records and data integrity framework) similarly frames a data governance framework as the umbrella under which data integrity controls live [5]. Governance, in other words, is the connective tissue between the abstract demand "data must have integrity" and the concrete controls — audit trails, access limits, validation — that deliver it.

Three roles: owner, steward, custodian (and why confusing them fails)

Three roles share the work, and confusing them is a classic failure mode [1]. The failure is rarely that the work is skipped — it is that the wrong person is asked to do it. Hand the custodian a question of business meaning ("is this material the same as that one?") and you get a technically tidy answer that is biologically wrong; ask the owner to fix a field by hand and the fix never propagates to the systems IT controls. Each duty needs its own home:

The data owner is accountable — the businessperson who answers for a data domain (say, "all batch records") and sets the rules for it.
The data steward is responsible — the subject-matter expert who tends the data day to day, defines what each field means, and resolves quality problems.
The data custodian is the technical caretaker — typically IT — who runs the storage, backups, and access controls, but does not decide the business meaning of the data.

Governance flows from a management mandate, through policy, to three distinct roles — accountability, stewardship, and technical custody — that together produce trustworthy data. Original diagram by the authors, created with AI assistance.

note

A useful shorthand: the owner is accountable (their name is on it), the steward is responsible (they do the work), and the custodian keeps the keys (they run the infrastructure). One person can wear two hats in a small lab, but the three duties must each have a home.

The six dimensions of data quality

"Good data" is too vague to manage. The discipline breaks quality into measurable dimensions — distinct properties you can check one at a time. The landmark study of what data quality means to the people who actually use data grouped these into four families: intrinsic (is the data correct in itself?), contextual (is it right for the task at hand?), representational (is it presented clearly?), and accessibility (can you get at it?) [2]. The international data-quality standard, ISO 8000, similarly frames quality as the degree to which data meets stated requirements [9]. In bioprocessing the dimensions that bite hardest are:

Accuracy — does the value match reality? A pH reading of 7.2 must reflect the actual broth, not a drifted, uncalibrated probe.
Completeness — is anything missing? A batch record with a blank in-process result is not merely untidy; under cGMP (current Good Manufacturing Practice — the binding regulations a plant must follow to make a drug) it is a data-integrity defect [6]. U.S. regulation makes this concrete: 21 CFR 211.188 — a section of the U.S. Code of Federal Regulations, the legally binding rulebook for drug manufacturing — requires batch production and control records to document all in-process testing and results, so a blank field where a result belongs is a compliance violation, not a stylistic one.
Consistency — does the same fact agree across systems? If the process historian (such as the AVEVA PI System, formerly OSIsoft PI) says the batch finished at 14:03 and the manufacturing system says 15:03, at least one is wrong.
Timeliness — is the data available when it is needed, and recorded when the event happened? Integrity guidance calls this being contemporaneous — recorded at the time of the activity, not reconstructed later [6].
Uniqueness — is each real-world thing represented exactly once? Two records for the same material lot is a recipe for mixing the wrong components.
Validity — does the value obey its rules? A temperature of "−500 °C" or a date of "2026-13-40" is invalid on its face.

These dimensions are the quality face of data integrity, which regulators define as data that is complete, consistent, and accurate across its lifecycle [6]. The ALCOA+ attributes — the regulators' shorthand for the marks of a trustworthy record: Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available — and these quality dimensions overlap by design — ALCOA+ frames them as integrity requirements a record must satisfy, while the dimensions frame them as measurable properties you can score; Complete the attribute and Completeness the dimension are the same idea seen from two angles, not two different things. The GxP integrity guidance — GxP being the umbrella term for the "Good x Practice" regulations (manufacturing, laboratory, clinical, and the rest) — frames the same idea through the ALCOA principles and stresses data criticality and risk assessment — spending the most quality effort where a wrong number would most endanger the patient or the product [7].

caution

Quality dimensions can conflict. Demanding more completeness (capture everything) can hurt timeliness (it takes longer), and chasing perfect accuracy on every trivial field wastes effort better spent on the critical ones. Good governance ranks data by criticality and applies the dimensions proportionately [7] — a risk-based stance that directly echoes the CSA mindset from the previous chapter. For example, a drifting temperature sensor in a mammalian cell-culture bioreactor directly affects product titer (the concentration of drug the cells produce) and patient safety (high criticality, warranting 100% review), while humidity logging in a storage room (low criticality) can rely on sampling-based checks.

Criticality-based data quality: risk-proportionate effort

The conflict above is not resolved by working harder; it is resolved by working selectively. Every governed record should carry a criticality rating that decides how much of each quality dimension to enforce. The same risk logic that the previous chapter applied to validating systems applies here to grading data: a record whose error could reach the patient earns the heaviest scrutiny — full audit-trail review, dual sign-off, 100% verification — while a record that only ever informs a convenience dashboard earns a light touch. This is why the bioreactor's titer and the bioreactor's physical operation in the production bioreactor are governed differently from a storage-room humidity log even though both are "just sensor readings." Criticality is not a property of the number; it is a property of what the number decides, and only the data owner and steward — not the custodian — can set it. The same release-critical analytical results born in analytical and formulation development sit at the top of that ranking, which is why their governance and their Part 11 controls — the FDA rule governing electronic records and signatures — are the strictest in the plant.

A downstream example: governing a purification record

The bioreactor is where governance is easiest to picture, but the criticality logic bites just as hard downstream, in purification — and there the unit operation itself decides which fields are high-criticality. Take Protein A capture (the affinity step that grabs the antibody out of the harvest by its Fc stem, described in capture chromatography). The step's governed record is not one number but a small panel: the dynamic binding capacity (DBC — how many grams of antibody one litre of resin holds at the running flow rate, typically 40–80 g/L), the host-cell-protein (HCP) clearance it achieves (2–3 log in a single pass), the leached-Protein A level (ligand that sheds off the beads, tracked to parts-per-million), and the operator's pooling-window cut points (which slice of the elution peak is kept as product). Every one of these is high-criticality, because each feeds directly into whether the lot is safe — and so each earns the heaviest quality treatment: 100% review, attributable sign-off, and a tamper-evident audit trail. The convenience-dashboard field on the same chromatography skid does not.

The same is true one step later, at low-pH viral inactivation (the acid hold that kills enveloped viruses, in viral inactivation). Its governed record is the cleanest illustration in the plant of why metadata, not the bare value, is what governance protects: a pH reading of 3.5 is meaningless on its own. The record carries the setpoint (SP — the value the control loop aimed for) and the process value (PV — the value the calibrated probe actually measured), because the proof of the hold lives in the measured trace, not the target; it carries the hold duration the stream actually spent with pH and temperature both in-window; and it carries a quality flag (Good, Uncertain, or Bad) the measuring system assigns from the probe's own health. A pH PV that drifts above the validated ceiling while the SP still reads the target is exactly the silent failure the accuracy and consistency dimensions exist to catch — and exactly why a governed record is a value-plus-metadata-plus-flag bundle, never a lone number. These downstream records are not a different governance discipline from the bioreactor's; they are the same six dimensions and the same criticality grading applied to a purification unit operation instead of an upstream one.

Meaning and identity: metadata and master data

Metadata: the context that makes data mean something

Back in Chapter 1, The Biologic and Its Data Shadow, we saw that a bare number — 37 — is meaningless without its context. Metadata is "data about data": the surrounding context that tells you the number is a temperature, in degrees Celsius, from bioreactor BR101, at 14:03:22, recorded by operator J. Lee. In a modern system that same reading is stored not as a lone 37 but as a structured object carrying its own context:

{
  "value": 37,
  "unit": "°C",
  "equipment_id": "BR101",
  "timestamp": "2026-06-14T14:03:22Z",
  "operator": "J. Lee",
  "sensor_id": "TEMP-001"
}

Without that surrounding context, the number cannot be interpreted, audited, or trusted. The integrity guidances are blunt that metadata is part of the record — data without its metadata is not complete data [6][7].

Governing metadata is therefore one of the named governance domains [3], and managing it well is what makes data findable and reusable rather than a write-only graveyard. The widely adopted FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable — put rich, machine-actionable metadata at their very center: data is only findable and reusable if it carries metadata a computer, not just a human, can read [4]. This is a FAIR touchpoint, and the hinge of the whole chapter: the Interoperable and Reusable halves of FAIR are unreachable without exactly the governed metadata and master data this chapter builds. Metadata governed today is what lets a machine connect your data to someone else's tomorrow.

Metadata as the link between raw values and meaning

Metadata is not decoration on top of the value; it is the join key — the shared value that links two records together — between a raw number and everything that gives it meaning. The equipment_id ties the reading to a governed equipment master record; the unit ties it to a governed unit-of-measure; the operator ties it to an ALCOA attributable identity. Strip the metadata and the number cannot be joined to anything — it becomes an orphan that no query can reach and no auditor can trust. Governed metadata, in other words, is what turns a pile of values into a graph of connected, interpretable facts. That is exactly why the integrity guidances treat metadata as part of the record, not an optional annex.

Master data management: one definition for "the same thing"

Metadata gives a single value its meaning; master data does the same job for the entities that values refer to — and here the payoff is the most concrete in the chapter. Master data is the shared reference data that describes the core entities a business runs on — its materials, its equipment, its products, its analytical methods, its suppliers. Unlike transaction data (which records events, like "batch 4471 started"), master data records things that persist and are referenced everywhere.

The problem master data solves is identifier chaos. The same raw material might be "Glucose" in the manufacturing system, "Dextrose" in the lab system (a LIMS — laboratory information management system — or ELN — electronic lab notebook — such as LabWare, Waters NuGenesis, or Labguru), and "GLC-001" in the inventory system. To a human these are obviously the same sugar; to software they are three unrelated strings, and any attempt to total up usage, trace a lot, or compare batches silently breaks. The same trap catches more complex items: a cell-culture medium might be "CHO growth medium" in the MES, "CD-CHO base medium" on the bench, and "RAW-MAT-2847" in ERP — three names for one material, with no way to reconcile lot traceability. Master Data Management (MDM) is the discipline of maintaining a single, authoritative, governed definition of each such entity, and propagating it consistently to every system that uses it [1].

Before and after Master Data Management: on the left the same material is called Glucose in MES, Dextrose in LIMS, and GLC-001 in ERP — three disconnected names; on the right, one governed master record MAT-00042 that MES, LIMS, and ERP all reference. Master data management replaces three system-local names for one material with a single governed master record that every system points to. Original diagram by the authors, created with AI assistance.

The international standard ISO 8000, originally a general industrial data-quality standard, provides a framework that the biopharmaceutical industry has adopted for exactly this: data-quality principles and a structure for master data and for exchanging it cleanly across organizations. ISO 8000 describes a series whose master-data parts — the ISO 8000-100 series, including ISO 8000-110:2021 — specify cross-organizational master-data exchange [9][10]. That cross-organizational reach matters, because a biopharmaceutical product is made by a web of partners — drug substance here, fill-finish there, testing somewhere else — and none of their data can be combined unless they first agree on what each material and method is.

Anatomy of a governed master record

So far MDM has been an idea; this is what it looks like as a row of data. Below is the anatomy of a single governed master record — a row in a gov.material_master table — for the very "Glucose" that earlier had three names. Note its criticality_level reads High — because a raw material that physically enters the product drives titer and patient safety, unlike a convenience-dashboard field. Read the rest as an identity card: an indigo header naming the record, a green core block holding the system_aliases that map MES, LIMS, and ERP names onto this one entity, and a violet panel of the typed relationships that let a lot or a batch resolve back to it.

The data-point at the heart of governance: one governed master record, with the three system-local aliases it reconciles, its owner and steward, its ALCOA lineage, and the typed links that make it the single source of truth. Original diagram by the authors, created with AI assistance.

Notice how the record carries the whole chapter inside it. The criticality_level field is the risk-proportionate dimension; data_owner and data_steward are two of the three roles; created_by and modified_ts are the ALCOA lineage; system_aliases is the master-data reconciliation; and audit_log_link is the integrity hook to Part 11. In the open-source implementation, the entities this record governs become concrete rows: batches and their lineage live in the batch and equipment model as s88.batch and s88.genealogy, the naming-governance side — keeping each alias mapped to the one canonical name — is the gov.tag_dictionary in the unified namespace and naming layer, and the material-master row itself is the reference-data concept those tables presuppose. The diagram here is the concept; those tables are the code.

The master record as a constrained triple

That "graph of connected facts" is not just a metaphor — it is the literal next chapter. Once each governed record is a node and each system_alias an edge, the master record can be written as RDF (the Resource Description Framework — the standard model for expressing data as subject-predicate-object triples), and the governance rule that makes it trustworthy can be written as a machine-checkable constraint. A canonical identity and its system aliases become a handful of triples in Turtle (RDF's text syntax), where gov: is our governance namespace and a reads "is a":

# one governed material master as RDF: a canonical identity reconciling three system aliases
gov:MAT-00042 a gov:MaterialMaster ;
    gov:entityName     "Glucose" ;
    gov:criticality    "High" ;
    gov:systemAlias     "Glucose" , "Dextrose" , "GLC-001" ;  # MES, LIMS, ERP names
    gov:dataOwner       gov:role-MaterialsOwner ;
    gov:wasGeneratedBy  gov:reconciliation-2026-03 .          # PROV-O provenance of the merge

The uniqueness dimension — "each real-world thing is represented exactly once" — is then no longer a hope but a checkable rule. In SHACL (the Shapes Constraint Language, the W3C standard for validating that graph data has the required structure), a shape gates the constraint that a master record carry exactly one canonical name and at least one owner, refusing to load any record that does not. This is the same closed-world gate the ontology book builds for the release specification, where a missing required result is a failure now rather than an open question — see the release gate and SHACL. And the cross-system reconciliation poses a natural competency question — a question the data must be able to answer, the unit of specification in an executable ORSD — "which system aliases resolve to one canonical material, and who owns it?" — answerable in one SPARQL query (the query language for RDF) precisely because the aliases were governed onto a single node first. Two further semantic threads run straight out of this record. Its wasGeneratedBy edge is PROV-O provenance (the W3C ontology for who or what produced a record), the formal version of the created_by/modified_ts ALCOA lineage — the subject of relations and genealogy. And the kind of thing the record names matters: a material is a continuant (an entity that persists through time and bears qualities) while the reconciliation that produced it is an occurrent (a process that happens and is over) — the upper-ontology distinction from classes and taxonomy that keeps a material master from ever being confused with the event that minted it. Governed metadata today is what makes all of this expressible tomorrow; the semantic interoperability chapter picks up exactly this handoff.

Putting it together: governance in a multi-system plant

A master record is not created once and frozen; it lives a governed lifecycle. The five stages below trace that loop for our Glucose record. The owner and steward define it; each plant system captures its own local alias; the steward stewards the fields and fixes defects; MDM reconciles the aliases onto one authoritative record; and the result is published to every system. Crucially, each published version is pinned to an effective date and superseded rather than overwritten — so a batch record from last March still resolves to the master definition that was in force last March, not today's.

The governed lifecycle of a master record: define, capture, steward, reconcile, publish — with every version effective-dated and superseded rather than overwritten. Original diagram by the authors, created with AI assistance.

Day to day, that lifecycle resolves into a single recurring decision: when a new record arrives, does it match an already-approved master record? The figure below traces that loop. If it matches, the system approves and propagates it automatically; if not, the data owner — not the custodian — decides whether to add it as a new governed entity or reject it, and either way the outcome is timestamped and synced out to the MES, LIMS, and ERP, or rejected with a logged reason. The owner, steward, and custodian role chips show who is allowed to make which call.

The day-to-day governance decision: match an arriving record to the approved master, then approve-and-propagate or let the owner add-or-reject — synced to MES, LIMS, and ERP or logged. Original diagram by the authors, created with AI assistance.

This is why governance is the prerequisite for connection rather than a sequel to it. The technical controls of Part 11 and Annex 11 protect a record once it exists; the lifecycle and decision flow above decide what the record is and keep every system pointing at the same version of it. Skip them and the most carefully validated audit trail still records changes to a definition nobody agrees on.

The cross-organizational challenge: master data nobody owns

Inside one company, MDM is a solved discipline: pick an owner, build the master table, publish it. For finished products and serialized items, the industry even has cross-organizational identifiers already — GS1's GTIN (a global trade item number, identifying a product type), GLN (a global location number, identifying a place or party), and SSCC (a serial shipping container code, identifying a specific shipped pallet or carton), mandated by the Drug Supply Chain Security Act (DSCSA) in the US and the Falsified Medicines Directive in the EU — so traceability of the saleable unit is largely solved [12]. What remains unsolved is master data for the process: raw-material grades, analytical methods, and especially each partner's own criticality rankings, which GS1 does not cover. That genuinely unsolved problem appears at the company boundary. A biopharmaceutical product is rarely made by one organization — a contract development and manufacturing organization (CDMO) makes the drug substance (the purified active ingredient, before it is formulated into its final form), a fill-finish partner puts it into vials, a contract lab releases it, and a distributor ships it. Each of those partners runs its own material master, its own identifiers, and — worse — its own criticality rankings. There is no shared owner with the authority to mandate a single definition across legal entities, so the reconciliation that MDM automates inside a plant collapses back into manual spreadsheet matching between plants. Today, harmonizing "is your RAW-MAT-2847 the same as our MAT-00042?" across a supply chain is still largely a human, email-and-spreadsheet exercise — and that fragility is exactly where lot-traceability gaps and supplier-data discrepancies originate [11].

Three threads are trying to close this gap, none of them finished:

A cross-organizational standard. ISO 8000-110:2021 specifies how to exchange master data with explicit, machine-checkable requirements for syntax, semantic encoding, and conformance to a data specification — precisely so that two organizations can agree on what a material is without a human in the loop [10]. (Data provenance is the subject of a separate part, ISO 8000-120.) It defines the format of agreement; it cannot force partners to adopt it.
Shared, neutral infrastructure. Industry consortia have piloted decentralized, blockchain-style ledgers — shared record-books that every partner can read and append to but no single partner controls — so that supply-chain partners can reference a shared identity for a product or material without any one party owning the database. The EU PharmaLedger project (2020–2022), an IMI/Horizon 2020 effort, is the best-known; it has since spun out into the member-funded PharmaLedger Association, though adoption as production cross-org infrastructure is still early (the authors' assessment).
The semantic layer. Even a shared identifier does not guarantee shared meaning — which is why this problem hands directly to the next chapter on semantic interoperability.

The honest status: within four walls, governance works; across the supply chain, the master data nobody owns is one of the field's real open problems.

Why it matters

For data management, governance is the difference between having data and being able to use it. The technical controls of the previous three chapters — validated systems, audit trails, access limits — are necessary but inert without the human layer that decides the rules and the definitions. Three roles assign accountability; six quality dimensions make "good" measurable; metadata makes numbers interpretable; and master data makes "the same thing" actually the same across every system. Skip the governance and you get the worst outcome in the field: a fast, well-validated pipeline that efficiently moves untrustworthy and unconnectable data. Every later ambition — analytics, a digital twin, regulatory submission by data rather than by document — rests on this foundation.

Governance is the substrate a model stands on

Nowhere does that dependence bite harder than in machine learning, where the most expensive failures trace straight back to ungoverned data. A model is only as trustworthy as the labels and the lineage it learned from, and three governance gaps each produce a specific, well-known modeling disaster. First, without governed batch identity — the very thing master data secures — a model cannot honor the batch-grouped split that bioprocess validation requires: rows from one batch must fall entirely on the train or the test side, never both, or near-duplicate within-batch neighbors leak (the train and test sets share information) and inflate the reported score into a number that collapses on the next real batch. The same governed identity is what lets cross-validation group by batch — GroupKFold or leave-one-batch-out — instead of shuffling rows, and the models-and-validation chapter shows exactly that failure. Second, the governed operating range on a master record — the qualified window a value is valid over — is what a model's applicability domain check reads at inference: a governed record can flag a spectrum or a setting that lies outside the envelope the model was trained on, before its number is trusted, whereas an ungoverned one cannot. Third, process drift (the living process genuinely moving — a new cell-line passage, a media-lot change) is invisible without governed, version-pinned master data to compare against; the effective-date pinning above is precisely what lets a drift monitor tell "the process changed" from "the master definition changed under me." And the model itself becomes a governed entity: the lineage edges in MLOps and lifecycle — trained-on which pinned dataset, validated under which plan, supersedes which version — are the same owner/steward/version-pinning discipline this chapter applies to a material, now applied to a model. Govern the data and you can govern the model; skip it and the model is a confident guess on data nobody agreed on.

This is also why governance and validation are two halves of one control, not separate chores. The previous chapter's move from CSV (computerized system validation — exhaustive, document-everything proof) to CSA (computer software assurance — risk-based critical thinking, in from CSV to CSA) is the same risk-proportionate logic the criticality grading above applies to data: a critical record earns the heavy treatment a critical function earns under IQ/OQ/PQ (installation, operational, and performance qualification — the rungs that prove a system installs, operates, and performs as specified), while a trivial one earns a lighter check. Governance supplies the inputs those controls need: the ALCOA+ attributes (the regulators' marks of a trustworthy record — attributable, legible, contemporaneous, original, accurate, plus complete, consistent, enduring, available) are what the data_owner, created_by, and modified_ts fields make true of every governed record, and the Part 11 / Annex 11 electronic-records controls (21 CFR Part 11 in the US, EU GMP Annex 11) are what the audit_log_link reaches into. Validation proves the system is trustworthy; governance decides what the records mean and who owns them — and a model, a release, or a submission needs both.

In the real world

Regulators have made governance a front-line expectation, not a nicety. The FDA's data-integrity guidance ties cGMP compliance to data being complete, consistent, and accurate, with reviewed audit trails [6]; the MHRA requires a documented data governance system sized to data criticality and risk [7]; and the WHO places ultimate responsibility on senior management [8]. On the standards side, ISO 8000 gives master-data work an internationally agreed footing [9], and FAIR has become the shared vocabulary for governing scientific data toward reuse [4]. A common failure shows what happens when that governance is missing: a lab introduces a new raw-material variant but the MES is never updated, so the steward quietly reconciles two spreadsheets by hand. The owner has no clear authority to mandate the MES change, and months pass before anyone notices that a batch record can no longer trace back to the true lot number. The fix is not more software; it is clearer accountability. The physical manufacturing world calls this same loss of traceability a quality event; the quality, regulatory, and data backbone of the manufacturing book exists precisely to prevent it, and governance is its data-side counterpart.

This is precisely the terrain of connectivity standards like SiLA 2 and the Allotrope Framework (two ways for lab instruments to expose their data in a common, agreed format), and OPC UA (the equivalent standard for equipment on the plant floor): before instruments and partner organizations can share data in real time, they must first agree on owners, definitions, and master records — governance first, connection second. The hardest part of "connecting the data" turns out to be the human agreement that has to happen before a single byte moves.

Key terms

Data governance — the system of decision rights and accountabilities over data: who may do what, with which data, under what rules.
Data owner — the businessperson accountable for a data domain and its rules.
Data steward — the subject-matter expert responsible for defining meaning and fixing quality day to day.
Data custodian — the technical caretaker (usually IT) who runs storage, backups, and access.
Data quality dimension — a distinct, checkable property of data, such as accuracy or completeness.
Data integrity — data that is complete, consistent, and accurate across its lifecycle.
Data criticality — how much a data error would endanger the patient or product; used to focus quality effort.
Metadata — data about data; the context that makes a raw value interpretable and trustworthy.
FAIR — Findable, Accessible, Interoperable, Reusable; principles centered on rich machine-actionable metadata.
Master data — the persistent reference data describing core entities: materials, equipment, products, methods.
Master Data Management (MDM) — maintaining a single governed definition of each entity and propagating it everywhere.
ISO 8000 — the international standard for data quality and master data, including cross-organizational exchange.
System alias — a system-local name for an entity (e.g. the MES, LIMS, and ERP names for one material) that a master record reconciles to a single canonical identity.
Effective date / version pinning — recording which version of a master definition was in force when, so historical records resolve to the definition that applied at the time rather than the latest one.
Cross-organizational master data — shared definitions of materials, methods, and products across supply-chain partners who each maintain their own identifiers; today's hardest, still-open governance problem.
Setpoint (SP) vs. process value (PV) — the value a control loop aimed for versus the value a calibrated probe actually measured; a governed record keeps both, because the proof lives in the measured PV trace.
Quality flag — a data-quality verdict (Good, Uncertain, or Bad) the measuring system attaches to a value so it is never read in isolation.
RDF triple — a subject-predicate-object fact; the standard way to express a governed record as a node-and-edge graph that a query can traverse.
SHACL shape — a machine-checkable constraint that gates whether graph data (such as a master record) meets a required structure, closed-world, before it is accepted.
Competency question — a question the governed data must be able to answer (such as "which aliases resolve to one canonical material?"), used to specify what the data model must support.
PROV-O provenance — the W3C ontology recording who or what produced a record; the formal version of a master record's created-by and modified-timestamp lineage.
Batch-grouped split — keeping every row of one batch entirely on the train or test side of a model evaluation, so near-duplicate within-batch rows do not leak and inflate the score; depends on governed batch identity.
Applicability domain — the qualified envelope a model is valid over; a governed operating range lets a model flag inputs that fall outside it before its prediction is trusted.
Process drift vs. data drift — the living process genuinely changing (a new passage or media lot) versus the master definition changing; version-pinned master data is what lets a drift monitor tell the two apart.

Where this leads

We have now governed our data: it has owners, measurable quality, rich metadata, and master records that agree on what each thing is. You might expect that to be the end of the connection problem — but it is only the beginning. Even when two systems exchange data with flawless syntactic interoperability (the bytes parse, the fields line up), the numbers still may not connect, because the same real-world thing is described differently in different places — different units, different identifiers, different timestamps, different vocabularies. The next chapter, Why Numbers Don't Connect: The Semantic Interoperability Problem, names this heterogeneity head-on and shows why it is what ultimately drives the field toward ontologies — shared, formal vocabularies that pin down what each term means, not just how it is spelled — and FAIR.

What this chapter covers​

Roles and quality: who owns the data and how good it is​

Data governance: rules and the people who own them​

Three roles: owner, steward, custodian (and why confusing them fails)​

The six dimensions of data quality​

Criticality-based data quality: risk-proportionate effort​

A downstream example: governing a purification record​

Meaning and identity: metadata and master data​

Metadata: the context that makes data mean something​

Metadata as the link between raw values and meaning​

Master data management: one definition for "the same thing"​

Anatomy of a governed master record​

The master record as a constrained triple​

Putting it together: governance in a multi-system plant​

The cross-organizational challenge: master data nobody owns​

Why it matters​

Governance is the substrate a model stands on​

In the real world​

Key terms​

Where this leads​