Identifiers and Units: IRIs, QUDT, and the Typed Value

📍 Where we are: Part IV · Formalization — the phase that makes the model bite. We can author classes, relations, and axioms. This chapter examines the two things every one of those quietly assumed: that a name means the same thing everywhere, and that a number never travels without its unit.

The last two chapters built a vocabulary and gave it logic. But every triple we wrote (an RDF statement of the form subject — predicate — object, e.g. PApool-001 derivedFrom CLAR-001) leaned on two unstated promises. When we said PApool-001 derivedFrom CLAR-001 — and through the chain, that the capture pool traces back to BATCH-2026-001 — we assumed each name picks out one thing, the same one to every system that reads it. And when we said monomerPct 98.611, we assumed 98.611 carries enough with it to be unambiguous. Neither promise holds for free — and the gap between them and reality is exactly the data "swamp" the semantic-interoperability chapter named: the state where the same facts live in many systems under clashing names and bare numbers, so nothing can be reliably joined. This chapter closes both gaps: the identifier that makes a name global, and the unit-and-datatype discipline that makes a value self-describing.

The simple version

Two small habits separate a number you can trust from one you cannot. The first is a passport: instead of calling something "BR-101" — a name that means one thing in your plant and something else in mine — you give it a globally unique identifier that cannot be confused with anyone else's, anywhere. The second is always writing the currency: "5" is useless; "5 USD" you can bank. A value of 98.611 is a rumor; 98.611, typed as a number and tagged with the unit percent, is a fact. This chapter is about giving every name a passport and every number its currency.

What this chapter covers

We unpack the IRI — why a global identifier beats a local primary key — and the discipline of minting and reconciling them, including how owl:sameAs links one real batch named four different ways and why that link is more dangerous than it looks. We then give values their identity: the datatype that makes 98.611 a number rather than text, and QUDT units that pin what the number measures. We dissect one fully-identified value, show how this is precisely what makes data Findable and Interoperable, and close on the genuinely unsolved problem of identity reconciliation across legacy systems.

Throughout, a prefix: before a name marks which vocabulary it comes from: bp: is our own campaign, while owl:, skos:, prov:, xsd:, and qudt: are shared standard vocabularies, each defined in the Key terms section at the end of this chapter.

The IRI: a global name, not a local key

In a relational database, BATCH-2026-001 is a primary key: unique within that table, meaningless outside it. The moment the lab system, the ERP (the plant's business and inventory system), and the historian (a time-series database that logs sensor readings) each hold their own "BATCH-2026-001," you have three keys that look identical and may not mean the same batch — or one batch with four different keys. RDF's answer is the IRI, an Internationalized Resource Identifier: a globally unique web name, built from a namespace prefix (the part before the colon, marking who owns the name) and a local id (the part after it), that resolves to one thing and one thing only [2]. Our bp:BATCH-2026-001 — namespace prefix bp:, local id BATCH-2026-001 — expands to https://example.org/bioproc#BATCH-2026-001, which cannot collide with another organization's batch of the same local name because the namespace makes it global. The subject, the predicate, and the resource-valued object of every triple are IRIs — so identity is built into the data model, not bolted on.

This is the structural form of the very first letter of the FAIR principles — the widely-adopted standard that data be Findable, Accessible, Interoperable, and Reusable — specifically F1, "data and metadata are assigned a globally unique and persistent identifier" [1]. A local key cannot be findable across systems; a persistent IRI can. Minting them well is its own small discipline — prefer opaque, stable ids over meaningful ones that rot when the meaning changes, never reuse a retired identifier, and decide up front who guarantees the namespace keeps resolving, because in a regulated setting that identity may need to resolve for the decades a record must be retained.

One batch, named four ways: reconciliation and its hazard

The hard case is not naming a new thing; it is discovering that four systems have already named the same thing differently. The bioreactor DCS (distributed control system — the plant's process-control computer) logs BR101.Temp.PV; the LIMS (laboratory information management system — the lab's sample-and-result manager) calls the lot DS-2026-001; the ERP (enterprise resource planning system) calls the material 1000457; a certificate of analysis (CofA) PDF says "Lot 26-001." Each is internally consistent and mutually unintelligible — the precise problem the open-source knowledge-graph chapter built its graph to solve. The graph is where a model that reconciles them can finally live. (What each of these plant systems is — the DCS and historian, the LIMS, the ERP, and the MES — the running example introduces in the four source systems, and the data book's Plant Information Systems chapter defines in full.)

The blunt tool is owl:sameAs, which asserts that two IRIs denote the identical individual, so everything true of one is true of the other. Write lims:DS-2026-001 owl:sameAs bp:DS-001 and a reasoner (the inference software that automatically derives new facts from the ones you state) fuses their facts. It is powerful and it is a loaded gun: owl:sameAs propagates every property in both directions, so a single over-eager identity link can contaminate a graph with false inferences — the well-documented "sameAs problem," where Linked Data practitioners found the relation routinely misused to mean "similar to" or "related to" rather than true identity [5]. It collapses two IRIs into one individual with identical everything, which is exactly wrong for living-thing identity, for cross-system records that must each keep their own provenance, and for version identity — so it is the primitive you must understand mainly in order to know when not to reach for it. The safer instruments for looser links are weaker on purpose: skos:exactMatch and skos:closeMatch record that two terms line up without claiming they are the same entity, so a bad match degrades a mapping instead of corrupting the whole graph.

And here is the decision this book actually made: the companion graph asserts no owl:sameAs at all — grep -rn 'sameAs' *.ttl finds none. When two systems described BATCH-2026-001 differently — the batch register (an MES — manufacturing execution system) said it is a Batch, while a genealogy loader (the ETL job that imports batch-lineage records) recorded that its run used "a bioreactor," which a naive union (simply pooling both systems' triples without curation) once turned into a second rdf:type (RDF's "is-a" membership statement) on one node — so a single IRI was asserted to be both a Batch and a Bioreactor at once, collapsing the material and the vessel into one thing — the fix was not to merge IRIs but to keep each source's assertion as a PROV claim and resolve them with a curation step:

# instances.ttl — reconciling two source systems WITHOUT an owl:sameAs over-merge.
bp:claim-batch-001  a prov:Entity ; prov:wasAttributedTo bp:BatchRegister .   # MES: it is a Batch (material)
bp:claim-vessel-001 a prov:Entity ; prov:wasAttributedTo bp:GenealogyLoad .   # ETL: the run used a bioreactor -> BR-101
bp:reconciliation-001 a prov:Activity ;                                       # the steward's curation decision
    prov:used bp:claim-batch-001 , bp:claim-vessel-001 ;
    prov:wasAssociatedWith bp:DataSteward .                                   # separates vessel BR-101 from the batch

(In the Turtle above, a is shorthand for rdf:type — "is a" — and the ; reuses the same subject for the next predicate.) Each claim stays attributed to the system that made it, the steward's prov:Activity records who decided and on what evidence, and the batch material and the vessel BR-101 stay two nodes — which is exactly what owl:sameAs would have destroyed. CQ-15 — one of the 23 numbered competency questions (the executable acceptance tests this book holds the model to) — is the acceptance test for this discipline: "Do two source-system records reconcile to one curated decision (PROV) without an owl:sameAs over-merge?" For a genuinely safe cross-system match — where you want to record that two terms line up but keep both records — use skos:exactMatch or a curated claim, not owl:sameAs. Choosing the right strength of link is one of the most consequential — and most underestimated — decisions in building a real bioprocess graph, and it is one triple either way (the source prefixes here are illustrative):

# A safe cross-system MATCH — terms line up, both records survive, no identity over-merge.
erp:material-1000457 skos:exactMatch bp:DS-001 .   # SAFE — terms line up, no identity claim

Four systems' names reconciled onto a single global IRI, bp:DS-001 — with owl:sameAs shown as the strong link that fuses all facts and skos:exactMatch as the safer match that does not, because the strength of the link is itself a modeling decision. Original diagram by the authors, created with AI assistance.

The typed value: a number is text until you say otherwise

Identity handles the subjects and predicates. The object — when it is a value rather than an edge — needs its own discipline, because a bare 98.611 is not yet a number to a machine; it is a string of characters that looks like one. RDF fixes this with the typed literal: a lexical form (the written-out text of the value, 98.611) paired with a datatype IRI (xsd:float, where "float" is a floating-point — i.e. fractional — number) that tells any reader to parse it as a number, not as text, a date, or a code [2]. Get the datatype wrong and arithmetic silently breaks — "98.611" as a string sorts and compares like text, so "98.611" is "less than" "9". The datatype is the difference between a value you can compute on and one you can only display.

But a typed number is still ambiguous about what it measures. 98.611 what? The data book's anatomy of a triple flagged this: the historian logged Monomer% = 98.6 with the percent buried in a label, while a CSV wrote %Mono,98.611,frac where "frac" might mean a fraction. Same number, three incompatible meanings. The cure is QUDT — the Quantities, Units, Dimensions and Types vocabulary — which lets the value carry its unit as a machine-readable IRI rather than a string suffix [3]. QUDT separates two things people conflate: the quantity kind (what sort of quantity this is — a dimensionless ratio, a temperature, a mass concentration) and the unit (unit:PERCENT, unit:DEG_C, unit:GM-PER-L). Pinning both means 98.611 unambiguously denotes 98.611 % of a dimensionless ratio, and the campaign's culture-temperature setpoint of 36.5 is unit:DEG_C with quantity kind Temperature — never to be misread as Fahrenheit or Kelvin. For interchange with clinical and laboratory systems the same role is played by UCUM, the Unified Code for Units of Measure, whose case-sensitive codes (%, Cel, g/L) are designed to be unambiguous across software [4]. The dataset carries that setpoint as its own qualified value, bp:RPS-temp-CCP001-qv, with the UCUM code Cel recorded alongside the QUDT unit, because the historian and the OPC UA server (the industrial communication protocol that streams sensor data off the plant floor) both speak UCUM on the wire — that is, in the messages they transmit:

# instances.ttl — the 36.5 degC setpoint, qualified once and readable by both worlds.
bp:RPS-temp-CCP001-qv a qudt:QuantityValue ;
    qudt:numericValue "36.5"^^xsd:float ; qudt:hasUnit unit:DEG_C ;
    qudt:hasQuantityKind qkind:Temperature ; qudt:ucumCode "Cel" .   # 'Cel' is the wire string the historian logs

A value that carries its unit as data is a future deviation that never happens. Here is the running example's monomer value, fully qualified, exactly as the loadable dataset carries it:

# instances.ttl — 98.611 made self-describing: a number, a unit IRI, and a quantity kind.
bp:DS-001-monomer a qudt:QuantityValue ; rdfs:label "DS-001 SEC %monomer" ;
    qudt:numericValue "98.611"^^xsd:float ;         # the number, typed as a float
    qudt:hasUnit unit:PERCENT ;                     # the unit as an IRI, not a string suffix
    qudt:hasQuantityKind qkind:DimensionlessRatio . # what *sort* of quantity it is

(QUDT's current property is qudt:hasUnit; the older qudt:unit is deprecated.) Now 98.611 can be read, compared, and converted by any system without guessing — the bare rumor has become a fact.

How an identifier and a unit are actually assigned. The two disciplines of this chapter are not folklore applied by feel; each value above was minted by the same short, reproducible procedure. Worked end to end on bp:DS-001-monomer — the released lot's SEC monomer reading the whole book leans on — it runs:

Decide whether the thing is reused or local, because that decides the namespace. The host organism is not ours to name — it is the Chinese hamster, which the NCBI Taxonomy (a public catalogue of organisms) already identifies — so the model reuses that term under the OBO PURL (a persistent, redirecting web identifier for an open biomedical ontology term), obo:NCBITaxon_10029 (verified to resolve via OLS4, the EBI Ontology Lookup Service the Reuse part insists on). A drug-substance lot, by contrast, exists only in this campaign, so it mints locally under bp: — bp:DS-001. The rule: reuse a global, persistent IRI for anything the world already names; mint a local bp: IRI only for what is genuinely yours.
Mint the local id to be stable, not meaningful. DS-001 identifies the lot without encoding its date, plant, or status — facts that change and would rot a meaning-bearing key — and it is never reused once retired (CQ-15's discipline).
Never let a number land bare — attach a unit through QUDT. The reading is 98.611, so it becomes a qudt:QuantityValue: the magnitude typed ("98.611"^^xsd:float), the unit pinned as an IRI (qudt:hasUnit unit:PERCENT), and the quantity kind pinned separately (qudt:hasQuantityKind qkind:DimensionlessRatio) so a reader knows a percent is a scaled ratio, not a dimension of its own.
Justify the scheme against persistence and round-tripping. The reused term sits behind a PURL that survives a host move; the local id would, in a real deployment, sit behind a PURL too (our example.org/bioproc# is the host-coupled teaching anti-pattern the closing section names); and for a value that must survive a GxP audit unchanged (GxP being the umbrella of Good Practice regulations — GMP and the like — under which pharmaceutical records are kept), the xsd:float would become xsd:decimal. Each choice is recorded, not assumed.

The monomer value carries this worked example for the same reason it carries the release gate: it is the one number the campaign's golden lot (the single worked-through manufacturing run this book follows end to end) is disposed on — the release decision that accepts or rejects the lot — sitting in spec (within its specification limit) at 98.611 % against its Spec-DS-mAb-A monomer limit, where the SEC-monomer criterion (monomer purity measured by size-exclusion chromatography) is at least 95.0 % — so getting its identity and its unit exactly right is the case where the discipline pays the most, and where a silent misread would do the most damage.

One pairing in that block repays a second look: the unit is unit:PERCENT, but the quantity kind is qkind:DimensionlessRatio, not a "Percent" kind. That is deliberate — a percentage is a dimensionless ratio scaled by one-hundredth, so the kind answers "what sort of quantity is this?" (a pure ratio) while the unit answers "in what scale is it written?" (parts per hundred). The same monomer fraction expressed as 0.98611 carries the identical quantity kind and a different unit; pinning the kind, not just the unit, is what lets a reasoner know the two are the same quantity in different clothes.

Two further facets repay a second look. First, the magnitude is typed xsd:float for brevity, but xsd:float is binary IEEE-754 and cannot store 98.611 exactly; for a number that must round-trip through a GxP record unchanged, xsd:decimal is the integrity-preserving datatype — the datatype choice is itself a data-integrity decision, not a formatting one. Second, ontologically a qudt:QuantityValue is not the quality. The monomer-purity quality inheres in the drug substance — it exists only in that one lot of material and nowhere else (in the upper-ontology vocabulary of BFO and IOF, a specifically dependent continuant). The QuantityValue, by contrast, is the information artifact that records that quality's magnitude — a fact you can copy, print, or email without moving the purity itself (a generically dependent continuant in the IOF/IAO sense). Keeping them distinct — the purity is in the material, the number is a record about it — is why the same 98.611 can be carried both as a convenience scalar (bp:monomerPct) and as the fully-qualified bp:DS-001-monomer without the model claiming the number is the purity.

One subtlety hides inside that tidy unit:PERCENT: a percent does not say percent of what. The running example's 98.611 is an area percent — the monomer peak's integrated UV (A280) area as a fraction of the total integrated area on the SEC chromatogram — not a mass, mole, or volume fraction. (Size-exclusion chromatography and how its peak areas become a purity number are the subject of the manufacturing book's analytical and formulation chapter.) A buffer's ethanol content might be % v/v; a surfactant % w/v; a charge-variant main peak, like the monomer, area %. All render as the same unit:PERCENT IRI, so pinning the unit is necessary but not sufficient: the basis of the ratio belongs in the quantity kind or the label too. This is the very ambiguity this chapter set out to kill, surviving one level down — which is why a careful graph records rdfs:label "SEC %monomer (area, A280)" rather than a bare "%monomer."

The dataset keeps both forms of this reading on the lot: a bare convenience scalar bp:monomerPct "98.611" (instances.ttl:372) for the quick lineage and release queries that just need the number, and the fully-qualified bp:DS-001-monomer QuantityValue above (linked via bp:monomerValue) for anything that must convert, compare across units, or be exported to a system that speaks a different unit dialect. Carrying both is a pragmatic compromise, not a contradiction — the scalar is fast to write and query, the QuantityValue is safe to ship.

Percent is the easy unit to picture, but the values that actually collide on a plant floor are the ones that share a dimension and differ only by a prefix or a basis. The clarified-harvest titer arrives as 4.8 g/L; the drug-substance protein concentration is specified 45-55 mg/mL — same dimension (mass per volume), a thousand-fold apart in prefix, and a single dropped conversion is a silent 1000x error no datatype catches. HCP is reported 12 ppm (equivalently, ng of host-cell protein per mg of product) — a dimensionless ratio whose basis (per total protein) must travel with it. Turbidity is 3.2 NTU, a unit with no clean QUDT IRI at all (the dataset's ucumCode "[NTU]" is frankly a placeholder). Each of these is exactly why the unit must be data, not a column header — and why the loaders (the ingest jobs that read each source system) normalize an OPC UA EngineeringUnits code or a historian unit string onto one QUDT/UCUM target before the value ever enters the graph. (What these physical quantities are — the titer, the clarified harvest, host-cell protein, turbidity — is the subject of the manufacturing book's harvest and clarification chapter.)

The same number, before and after: a bare 98.611 is text of unknown kind and unit; layered with a datatype, a QUDT unit, and a quantity kind, it becomes a self-describing fact a machine can compute on and never misread. Original diagram by the authors, created with AI assistance.

Why this is exactly what makes data FAIR

The two disciplines of this chapter map directly onto two of FAIR's four letters, which is why they are foundations and not finishing touches [1]. Global, persistent IRIs are what make data Findable (F1) and addressable across systems — you cannot find what has no stable name. Shared vocabularies plus qualified, unit-bearing, datatyped values are what make data Interoperable (I1–I3) — the "I" that the data book showed is the one real datasets most often fail, because their metadata used bare strings and local keys. The reconciliation-and-normalize pipeline the data book sketched — map every name to one shared predicate, pin every value's unit and datatype, then load — is, step by step, the identity and unit discipline of this chapter applied at scale.

This is not a slogan the book leaves unchecked. The companion makes "a value never travels bare" an executable acceptance test, CQ-19: "Does every stored quantity value carry a unit (a QUDT unit IRI or a UCUM code) — no bare numbers?" Its SPARQL query (SPARQL is the query language for RDF graphs; SELECT ?qv asks for every value ?qv matching the pattern below) asks for the offenders, and the model is correct only when there are none —

# CQ-19 — returns every quantity value missing BOTH a unit IRI and a UCUM code. PASS = zero rows.
SELECT ?qv WHERE {
  ?qv a qudt:QuantityValue ; qudt:numericValue ?n .
  FILTER NOT EXISTS { ?qv qudt:hasUnit  ?u . }
  FILTER NOT EXISTS { ?qv qudt:ucumCode ?c . }
}

The OR is deliberate: a value is acceptable if it carries a unit IRI or a UCUM code, so the query must flag only values that carry neither — which is why it uses two FILTER NOT EXISTS clauses (each rules out one of the two), together meaning exactly "has no unit IRI and no UCUM code." Turbidity makes the point: it is logged as [NTU], a unit QUDT has no clean IRI for, so its value carries a UCUM code instead of qudt:hasUnit — and the test still passes because a magnitude with a UCUM code is not bare. Interoperability you can run as a test is the difference between a FAIR claim and FAIR data.

Why a model downstream inherits these two guarantees

The two disciplines look like data hygiene, but they are also the upstream safeguard for everything an inference engine or a learned model does later — and the cost of skipping them does not stay local; it lands, amplified, in the answers a system gives a human. The companion volume on machine learning makes the dependency explicit in the other direction: a model is only ever as trustworthy as the structured, governed knowledge it is anchored to, and that knowledge is exactly the identified, unit-bearing graph this chapter builds. Three concrete couplings are worth naming.

First, unit confusion is a data-leakage path. The mg/mL-versus-g/L thousand-fold trap above is not only a unit error; fed into training data, it is a feature that means different things in different rows, and a learner cannot tell the difference between two genuine populations and one population logged in two unit dialects. The same goes for a percent whose basis (area vs mass vs recovery) silently changes between sources. A qudt:QuantityValue that pins the magnitude, the unit IRI, and the quantity kind is what lets the loader normalize every source onto one commensurable feature before a model ever sees it — the reconcile-then-normalize pipeline the data book describes, read here as feature engineering's quietest prerequisite. The ML book's models-and-validation chapter names this directly: a soft sensor's most common silent failure is a feature that is not the same quantity across batches.

Second, identity is what makes leave-one-batch-out validation possible at all. Honest cross-validation over instances requires grouping every row by the physical batch it came from and holding whole batches out — a model scored on rows from a batch it also trained on reports a flattering number it will never reproduce in the plant. That grouping is impossible unless the bp:BATCH-2026-001 identity is global and stable across the historian, the LIMS, and the release record — the very discipline this chapter installs. A graph in which one batch is named four ways, or two batches are silently fused by an over-eager owl:sameAs, breaks the group boundary and leaks the held-out set into training. Identity discipline and grouped cross-validation are the same guarantee seen from two ends.

Third, the graph is the ground truth a model is checked against, not a fact it learns. When a retrieval-augmented language model answers "what was this lot derived from?" it should traverse the typed bp:derivedFrom edges and cite them — the GraphRAG pattern this book's own frontier chapter builds — rather than narrate a plausible lineage from training memory. That only works if the edges are real and the nodes are globally identified; a fluent model will compose a confident wrong answer as smoothly as a right one, and only the identified, validated graph can tell the two apart. The release gate's SHACL shapes earn a second life here: the same constraint that refuses a unit-less release refuses a unit-less or incompletely-typed retrieval, certifying a subgraph is well-formed before a model is handed it — the validation asymmetry the ML book frames as a reasoned graph constraining a guessing model.

$A three-stage flow turning heterogeneous inputs into FAIR facts: stage one reconciles four differently-named source values — a DCS tag BR101.Temp.PV, a LIMS field DS-2026-001, an ERP number 1000457, and a CofA string Lot 26-001 — onto one global IRI using an exactMatch or curated PROV claim rather than a fact-fusing owl; stage two normalizes the value, converting the fraction 0.98611 or the rounded 98.6 to a percent and pinning xsd and the QUDT unit PERCENT to produce one typed literal; stage three loads the result as a self-describing triple, annotated with the FAIR letters it satisfies (F1 global identifier, I1–I3 shared vocabulary and qualified units); a sidebar marks the failure mode where bare strings and local keys leave data findable but not interoperable.$ Identity then units, in one pass: reconcile every name onto a global IRI, normalize and type every value with its QUDT unit, and the loaded triple is Findable and Interoperable by construction — the FAIR "I" most real datasets miss. Original diagram by the authors, created with AI assistance.

The unsolved part: identity reconciliation is still partly manual

Units are mostly the solved half of this chapter — with one honest asterisk. QUDT and UCUM remove dimensional ambiguity completely: unit:DEG_C can never be misread as Kelvin, unit:GM-PER-L never as unit:MilliGM-PER-MilliL even though both are mass over volume. The asterisk is percent and basis. unit:PERCENT and UCUM % pin the magnitude but not whether the ratio is w/w, v/v, mol %, area %, % recovery, or % CV — and in a mAb plant those genuinely differ: SEC monomer and CEX main peak are area %, an excipient is % w/v, a yield is % recovery. The cure is not a better unit IRI but discipline in the quantity kind and label.

A second residual case is the log-reduction value (LRV — a measure of how far a purification step lowers virus levels; an LRV of 4.5 means the virus count drops by a factor of 10⁴·⁵). The viral-clearance LRVs the companion carries (4.5 for the low-pH hold, 4.2 for nanofiltration) are base-10 logs of a titer ratio with no clean QUDT unit, so they travel as bare typed floats with the semantics carried by the property (bp:lrvValue) instead of a unit. Because they are logarithms, two independent (orthogonal) steps' values add — 4.5 plus 4.2 gives 8.7 total clearance — which is exactly the arithmetic a unit alone could never convey. (These viral-clearance steps are covered in the manufacturing book's viral inactivation and viral filtration chapters.) So even units are not uniformly solved — they are solved where a dimension exists to pin, and merely disciplined where the quantity is a basis-dependent ratio or a log.

Identity reconciliation is the unsolved half. Deciding that the DCS's BR101.Temp.PV, the LIMS's DS-2026-001, the ERP's 1000457, and a PDF's "Lot 26-001" all refer to the same batch is an entity-resolution problem, and across decades-old legacy systems with no shared key it remains stubbornly manual or semi-automated, error-prone, and expensive. Get one match wrong and owl:sameAs does not fail quietly — it fuses two distinct batches into one and propagates every false fact in both directions, the contamination the sameAs literature warned about [5]. There is no reasoner that can tell you two correctly-typed IRIs denote the same physical thing; that judgment comes from process knowledge a human supplies.

A second open problem is quieter and longer-term: persistence. FAIR demands identifiers that stay resolvable, but a bp: IRI is only as durable as the organization and namespace behind it. Who guarantees that https://example.org/bioproc#BATCH-2026-001 still resolves in fifteen years, the horizon a GxP record may have to survive? PID governance — handles, PURLs, registries, and the institutions that keep them alive — is a real and underfunded discipline, and an identifier that dies is no better than the local key it replaced. So the standard this chapter sets is sober: units you can make unambiguous today; identity and its persistence you must actively govern, and the field has not handed you an automatic answer.

The fix the field converged on is to decouple the identifier from its host. A production ontology does not mint IRIs under a server it owns; it mints them under a redirection service — a PURL (purl.obolibrary.org/obo/… for OBO terms), a w3id.org permanent URL, or a Handle/DOI — that resolves to wherever the artifact currently lives. The host can move, the company can be acquired, the URL scheme can change; the identifier does not, because the redirect, not the web server, is the thing under governance. Our https://example.org/bioproc# namespace is, frankly, the anti-pattern — a host-coupled, non-resolving example IRI — and a real deployment would put bioproc behind a PURL the moment it left the teaching repository. How a graph is published — under what persistent identifier, in what registry, with what commitment to keep resolving — is itself a chapter: Publication and FAIR takes up exactly this question of durable, governed identifiers for the ontology and its instance data.

Why it matters

Every edge the rest of this book draws and every value it hangs depends on the two guarantees made here. If identity is shaky, a lineage walk can silently cross between two different batches and an investigation reaches a wrong conclusion with full confidence. If units are bare, a number means different things in different systems and a control decision is made on a misread value. The whole edifice of classes, relations, and axioms rests on names that are global and values that are self-describing; this is the bedrock under all of it, which is why it closes the foundations before we start modeling the process itself.

These guarantees do not have to stay promises. Two chapters on, the release gate turns "a value never travels bare" into an enforced rule: a SHACL shape (SHACL is the constraint language that validates a graph against rules) rejects any quantity value that arrives without a unit, so the discipline this chapter argues for becomes a gate a non-conforming triple cannot pass. Identity has no such mechanical guard — which is exactly why reconciliation is the unsolved half.

In the real world

These are not bespoke ideas. QUDT is a published, widely-used units ontology, and UCUM is the unit standard embedded in clinical and laboratory data exchange worldwide [3][4]. Persistent-identifier systems — DOIs for publications, handles and PURLs for data, GS1 keys for traded goods — already run at global scale, and the serialization chapter shows GS1 identifiers doing exactly this job on a finished vial. On the reconciliation side, the commercial "data fabric" and "master data management" products that plants buy are, underneath the branding, entity-resolution engines wrestling with precisely the identity problem this chapter says is unsolved — which is why they remain expensive, human-supervised, and never quite finished.

Key terms

IRI (Internationalized Resource Identifier) — the globally unique web name RDF gives every subject, predicate, and resource object; unlike a local primary key it means the same thing across systems and sites.
Namespace / minting — the prefix that makes a local id global, and the discipline of creating stable, non-reused identifiers.
Reasoner — inference software that automatically derives new, logically-entailed facts from the triples you assert; this is precisely why an over-eager owl:sameAs propagates so widely.
owl:sameAs — the strong assertion that two IRIs denote the identical individual, fusing all their facts; powerful and easily misused (the "sameAs problem").
skos:exactMatch / closeMatch — softer links that record two terms line up without claiming entity identity, so a bad match degrades a mapping rather than corrupting the graph.
PROV-based reconciliation — instead of fusing IRIs with owl:sameAs, recording each system's assertion as a prov:Entity attributed to its source agent and resolving the conflict with a steward prov:Activity; keeps the audit trail and avoids the over-merge (the pattern CQ-15 tests).
Typed literal / datatype — a value paired with a datatype IRI (xsd:float) so a machine parses it as a number, date, or code rather than as text.
QUDT — the Quantities, Units, Dimensions and Types vocabulary; lets a value carry its unit and quantity kind as machine-readable IRIs rather than a string suffix.
UCUM — the Unified Code for Units of Measure; case-sensitive unit codes designed to be unambiguous across software, used widely in lab and clinical exchange.
Quantity kind vs unit — what sort of quantity a value is (temperature, mass concentration, dimensionless ratio) versus the specific unit it is expressed in.
Entity resolution / reconciliation — deciding that differently-named records refer to the same real thing; across legacy systems, still largely manual and the unsolved half of identity.
FAIR F1 / I — the principles that data carry a globally unique persistent identifier (Findable) and use shared vocabularies with qualified, unit-bearing values (Interoperable).
Leave-one-batch-out / grouped cross-validation — scoring a model only on whole physical batches held out of training; impossible unless batch identity is global and stable, so identity discipline is also a leak-prevention guarantee for downstream models.
Ground truth / graph constraint — the identified, unit-bearing, validated graph an inference engine or a retrieval-augmented model is checked against rather than learns; SHACL shapes certify a subgraph is well-typed before a model is handed it.

Where this leads

With identifiers and units in place, the model's Formalization is complete: the upper spine borrowed in Reuse, the classes and relations of Conceptualization, and now the axioms and the identity-and-unit discipline that keeps names global and values self-describing. We are ready to point the whole kit at the process and fill it. The next chapter, Implementation: Building the Instance Graph, opens Part V: it instantiates the one antibody campaign — a frozen vial of engineered CHO cells expanded, harvested, captured, polished, and filled — as real individuals in a loadable file, then asks the lineage and release questions a manufacturer actually asks.

What this chapter covers​

The IRI: a global name, not a local key​

One batch, named four ways: reconciliation and its hazard​

The typed value: a number is text until you say otherwise​

Why this is exactly what makes data FAIR​

Why a model downstream inherits these two guarantees​

The unsolved part: identity reconciliation is still partly manual​

Why it matters​

In the real world​

Key terms​

Where this leads​