Identifiers and Units: IRIs, QUDT, and the Typed Value
📍 Where we are: Part I · Foundations of the Model — Chapter 3. We can author classes, relations, and axioms. This chapter examines the two things every one of those quietly assumed: that a name means the same thing everywhere, and that a number never travels without its unit.
The last two chapters built a vocabulary and gave it logic. But every triple we wrote leaned on two unstated promises. When we said PApool-001 derivedFrom CLAR-001 — and through the chain, that the capture pool traces back to BATCH-2026-001 — we assumed each name picks out one thing, the same one to every system that reads it. And when we said monomerPct 98.611, we assumed 98.611 carries enough with it to be unambiguous. Neither promise holds for free — and the gap between them and reality is exactly the swamp the semantic-interoperability chapter named. This chapter closes both gaps: the identifier that makes a name global, and the unit-and-datatype discipline that makes a value self-describing.
Two small habits separate a number you can trust from one you cannot. The first is a passport: instead of calling something "BR-101" — a name that means one thing in your plant and something else in mine — you give it a globally unique identifier that cannot be confused with anyone else's, anywhere. The second is always writing the currency: "5" is useless; "5 USD" you can bank. A value of 98.611 is a rumor; 98.611, typed as a number and tagged with the unit percent, is a fact. This chapter is about giving every name a passport and every number its currency.
What this chapter covers
We unpack the IRI — why a global identifier beats a local primary key — and the discipline of minting and reconciling them, including how owl:sameAs links one real batch named four different ways and why that link is more dangerous than it looks. We then give values their identity: the datatype that makes 98.611 a number rather than text, and QUDT units that pin what the number measures. We dissect one fully-identified value, show how this is precisely what makes data Findable and Interoperable, and close on the genuinely unsolved problem of identity reconciliation across legacy systems.
The IRI: a global name, not a local key
In a relational database, BATCH-2026-001 is a primary key: unique within that table, meaningless outside it. The moment the lab system, the ERP, and the historian each hold their own "BATCH-2026-001," you have three keys that look identical and may not mean the same batch — or one batch with four different keys. RDF's answer is the IRI, an Internationalized Resource Identifier: a globally unique web name, built from a namespace prefix and a local id, that resolves to one thing and one thing only [2]. Our bp:BATCH-2026-001 expands to https://example.org/bioproc#BATCH-2026-001, which cannot collide with another organization's batch of the same local name because the namespace makes it global. The subject, the predicate, and the resource-valued object of every triple are IRIs — so identity is built into the data model, not bolted on.
This is the structural form of the FAIR principles' very first letter: F1, "data and metadata are assigned a globally unique and persistent identifier" [1]. A local key cannot be findable across systems; a persistent IRI can. Minting them well is its own small discipline — prefer opaque, stable ids over meaningful ones that rot when the meaning changes, never reuse a retired identifier, and decide up front who guarantees the namespace keeps resolving, because in a regulated setting that identity may need to resolve for the decades a record must be retained.
One batch, named four ways: reconciliation and its hazard
The hard case is not naming a new thing; it is discovering that four systems have already named the same thing differently. The bioreactor DCS logs BR101.Temp.PV; the LIMS calls the lot DS-2026-001; the ERP calls the material 1000457; a certificate of analysis PDF says "Lot 26-001." Each is internally consistent and mutually unintelligible — the precise problem the open-source knowledge-graph chapter built its graph to solve. The graph is where a model that reconciles them can finally live.
The blunt tool is owl:sameAs, which asserts that two IRIs denote the identical individual, so everything true of one is true of the other. Write lims:DS-2026-001 owl:sameAs bp:DS-001 and a reasoner fuses their facts. It is powerful and it is a loaded gun: owl:sameAs propagates every property in both directions, so a single over-eager identity link can contaminate a graph with false inferences — the well-documented "sameAs problem," where Linked Data practitioners found the relation routinely misused to mean "similar to" or "related to" rather than true identity [5]. The safer instruments for looser links are weaker on purpose: skos:exactMatch and skos:closeMatch record that two terms line up without claiming they are the same entity, so a bad match degrades a mapping instead of corrupting the whole graph. Choosing the right strength of link is one of the most consequential — and most underestimated — decisions in building a real bioprocess graph — and it is one triple either way (the source prefixes here are illustrative):
# Two systems' names for the same lot, reconciled onto one bp: IRI — at two strengths.
lims:DS-2026-001 owl:sameAs bp:DS-001 . # STRONG — fuses every fact in both directions
erp:material-1000457 skos:exactMatch bp:DS-001 . # SAFE — terms line up, no identity claim
One real batch, named four ways, reconciled onto a single global IRI — with
owl:sameAs shown as the strong link that fuses all facts and skos:exactMatch as the safer match that does not, because the strength of the link is itself a modeling decision.
Original diagram by the authors, created with AI assistance.
The typed value: a number is text until you say otherwise
Identity handles the subjects and predicates. The object — when it is a value rather than an edge — needs its own discipline, because a bare 98.611 is not yet a number to a machine; it is a string of characters that looks like one. RDF fixes this with the typed literal: a lexical form (98.611) paired with a datatype IRI (xsd:float) that tells any reader to parse it as a floating-point number, not as text, a date, or a code [2]. Get the datatype wrong and arithmetic silently breaks — "98.611" as a string sorts and compares like text, so "98.611" is "less than" "9". The datatype is the difference between a value you can compute on and one you can only display.
But a typed number is still ambiguous about what it measures. 98.611 what? The data book's anatomy of a triple flagged this: the historian logged Monomer% = 98.6 with the percent buried in a label, while a CSV wrote %Mono,98.611,frac where "frac" might mean a fraction. Same number, three incompatible meanings. The cure is QUDT — the Quantities, Units, Dimensions and Types vocabulary — which lets the value carry its unit as a machine-readable IRI rather than a string suffix [3]. QUDT separates two things people conflate: the quantity kind (what sort of quantity this is — a dimensionless ratio, a temperature, a mass concentration) and the unit (unit:PERCENT, unit:DEG_C, unit:GM-PER-L). Pinning both means 98.611 unambiguously denotes 98.611 % of a dimensionless ratio, and a temperature of 37.2 is unit:DEG_C with quantity kind Temperature — never to be misread as Fahrenheit or Kelvin. For interchange with clinical and laboratory systems the same role is played by UCUM, the Unified Code for Units of Measure, whose case-sensitive codes (%, Cel, g/L) are designed to be unambiguous across software [4]. A value that carries its unit as data is a future deviation that never happens. Here is the running example's monomer value, fully qualified, exactly as the loadable dataset carries it:
# instances.ttl — 98.611 made self-describing: a number, a unit IRI, and a quantity kind.
bp:DS-001-monomer a qudt:QuantityValue ; rdfs:label "DS-001 SEC %monomer" ;
qudt:numericValue "98.611"^^xsd:float ; # the number, typed as a float
qudt:hasUnit unit:PERCENT ; # the unit as an IRI, not a string suffix
qudt:hasQuantityKind qkind:DimensionlessRatio . # what *sort* of quantity it is
(QUDT's current property is qudt:hasUnit; the older qudt:unit is deprecated.) Now 98.611 can be read, compared, and converted by any system without guessing — the bare rumor has become a fact.
The same number, before and after: a bare
98.611 is text of unknown kind and unit; layered with a datatype, a QUDT unit, and a quantity kind, it becomes a self-describing fact a machine can compute on and never misread.
Original diagram by the authors, created with AI assistance.
Why this is exactly what makes data FAIR
The two disciplines of this chapter map directly onto two of FAIR's four letters, which is why they are foundations and not finishing touches [1]. Global, persistent IRIs are what make data Findable (F1) and addressable across systems — you cannot find what has no stable name. Shared vocabularies plus qualified, unit-bearing, datatyped values are what make data Interoperable (I1–I3) — the "I" that the data book showed is the one real datasets most often fail, because their metadata used bare strings and local keys. The reconciliation-and-normalize pipeline the data book sketched — map every name to one shared predicate, pin every value's unit and datatype, then load — is, step by step, the identity and unit discipline of this chapter applied at scale.
Identity then units, in one pass: reconcile every name onto a global IRI, normalize and type every value with its QUDT unit, and the loaded triple is Findable and Interoperable by construction — the FAIR "I" most real datasets miss.
Original diagram by the authors, created with AI assistance.
The unsolved part: identity reconciliation is still partly manual
Units are, honestly, the solved half of this chapter. Given the will, QUDT and UCUM remove unit ambiguity completely — the failures there are discipline failures, not gaps in the tools. Identity reconciliation is the unsolved half. Deciding that the DCS's BR101.Temp.PV, the LIMS's DS-2026-001, the ERP's 1000457, and a PDF's "Lot 26-001" all refer to the same batch is an entity-resolution problem, and across decades-old legacy systems with no shared key it remains stubbornly manual or semi-automated, error-prone, and expensive. Get one match wrong and owl:sameAs does not fail quietly — it fuses two distinct batches into one and propagates every false fact in both directions, the contamination the sameAs literature warned about [5]. There is no reasoner that can tell you two correctly-typed IRIs denote the same physical thing; that judgment comes from process knowledge a human supplies.
A second open problem is quieter and longer-term: persistence. FAIR demands identifiers that stay resolvable, but a bp: IRI is only as durable as the organization and namespace behind it. Who guarantees that https://example.org/bioproc#BATCH-2026-001 still resolves in fifteen years, the horizon a GxP record may have to survive? PID governance — handles, PURLs, registries, and the institutions that keep them alive — is a real and underfunded discipline, and an identifier that dies is no better than the local key it replaced. So the standard this chapter sets is sober: units you can make unambiguous today; identity and its persistence you must actively govern, and the field has not handed you an automatic answer.
Why it matters
Every edge the rest of this book draws and every value it hangs depends on the two guarantees made here. If identity is shaky, a lineage walk can silently cross between two different batches and an investigation reaches a wrong conclusion with full confidence. If units are bare, a number means different things in different systems and a control decision is made on a misread value. The whole edifice of classes, relations, and axioms rests on names that are global and values that are self-describing; this is the bedrock under all of it, which is why it closes the foundations before we start modeling the process itself.
In the real world
These are not bespoke ideas. QUDT is a published, widely-used units ontology, and UCUM is the unit standard embedded in clinical and laboratory data exchange worldwide [3][4]. Persistent-identifier systems — DOIs for publications, handles and PURLs for data, GS1 keys for traded goods — already run at global scale, and the serialization chapter shows GS1 identifiers doing exactly this job on a finished vial. On the reconciliation side, the commercial "data fabric" and "master data management" products that plants buy are, underneath the branding, entity-resolution engines wrestling with precisely the identity problem this chapter says is unsolved — which is why they remain expensive, human-supervised, and never quite finished.
Key terms
- IRI (Internationalized Resource Identifier) — the globally unique web name RDF gives every subject, predicate, and resource object; unlike a local primary key it means the same thing across systems and sites.
- Namespace / minting — the prefix that makes a local id global, and the discipline of creating stable, non-reused identifiers.
owl:sameAs— the strong assertion that two IRIs denote the identical individual, fusing all their facts; powerful and easily misused (the "sameAs problem").skos:exactMatch/closeMatch— softer links that record two terms line up without claiming entity identity, so a bad match degrades a mapping rather than corrupting the graph.- Typed literal / datatype — a value paired with a datatype IRI (
xsd:float) so a machine parses it as a number, date, or code rather than as text. - QUDT — the Quantities, Units, Dimensions and Types vocabulary; lets a value carry its unit and quantity kind as machine-readable IRIs rather than a string suffix.
- UCUM — the Unified Code for Units of Measure; case-sensitive unit codes designed to be unambiguous across software, used widely in lab and clinical exchange.
- Quantity kind vs unit — what sort of quantity a value is (temperature, mass concentration, dimensionless ratio) versus the specific unit it is expressed in.
- Entity resolution / reconciliation — deciding that differently-named records refer to the same real thing; across legacy systems, still largely manual and the unsolved half of identity.
- FAIR F1 / I — the principles that data carry a globally unique persistent identifier (Findable) and use shared vocabularies with qualified, unit-bearing values (Interoperable).
Where this leads
Part I is complete: we have the upper spine, the craft of authoring classes and axioms, and the identity-and-unit discipline that keeps names global and values self-describing. We are ready to point the whole kit at the process. Part II begins where the medicine itself begins — not in a tank, but in an idea. The next chapter, Modeling the Target and the Product Concept, names the very first entities of a program: the disease target, the mechanism of action, and the product concept, and shows how reaching for established biomedical ontologies (for genes, proteins, and diseases) means the model is interoperable from its first day rather than its last.