Naming Things: Tags, Hierarchies, and the Unified Namespace

📍 Where we are: Chapter 4 gave us the ISA-88/95 batch and equipment model in PostgreSQL. Before we capture a single live reading in Part II, this Chapter 5 gives every signal a name — one that the historian, the broker, the knowledge graph, and a regulator can all agree on.

Phil Karlton, a famous engineer at Netscape, supposedly said there are only two hard things in computer science: cache invalidation and naming things. In a biomanufacturing plant the second one quietly decides whether the first book's promise — trace any number back to its source — is achievable at all. A bioreactor temperature probe does not emit knowledge. It emits a tag: a short string and a float, like TT-101 = 37.02. If that same probe is called TIC_101.PV in the Emerson DeltaV control system (the plant's DCS, or distributed control system), BR1_TEMP in the historian, and Reactor1 Temperature in the spreadsheet a scientist keeps, you do not have one signal — you have three, and no machine can tell they are the same thing.

Multiply that by ten thousand tags across a SCADA (the plant-floor supervisory control system), an MES (the manufacturing execution system that runs the batch), and an ERP (the enterprise resource planning system that runs the business), and the result is the single most common, least glamorous cause of data-management failure in the industry [1].

This chapter is the cure, and it is mostly discipline rather than code. We design one naming convention, ground it in ISA-95, project it into a Unified Namespace (UNS) topic tree and a Sparkplug B topic, store it as governed data, and — because this is the hands-on book — ship a linter that refuses to let a bad tag into the platform. The code is small. The payoff is everything downstream.

The simple version

A tag name is a postal address for data. BR101.Temp.PV is "the present value of the temperature measurement on bioreactor 101" — and newark/upstream/BR101/Temp is the same address written as a folder path the whole plant can browse. Pick the address format once, write it down, and have a robot at the door turn away anything that doesn't match. After that, every later chapter just files its letters in the right box.

What this chapter covers

We start with why one signal must have exactly one canonical name, then build the convention on top of the ISA-95 hierarchy from Chapter 4. We turn each tag into a UNS path and a Sparkplug B topic, look at how ISA-5.1 instrument tags and ISA-95 Part 7 aliasing let the floor and the cloud disagree on labels without losing the thread, and store the whole dictionary as governed data in PostgreSQL. Finally we run the real linter from the companion repo against the simulator's 16 tags, watch it pass, and watch it reject the kinds of drift that happen in real plants.

One signal, one canonical name

Open the running case's golden data and look at the raw tags. This is a slice of datasets/fedbatch_timeseries_10min.sample.csv from the companion repo — the long-format time series (one row per timestamp-and-tag pair, with the signal name in a tag column and the reading in a value column, not a column per tag) our 14-day fed-batch (nutrients fed in over the run, rather than drained and replenished continuously) run of CHO cells (the Chinese-hamster-ovary line that secretes the antibody) produces:

datasets/fedbatch_timeseries_10min.sample.csv (excerpt)
ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.Agitation.PV,81.4323,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Agitation.SP,81.6008,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.SP,40.0,%sat,192,BATCH-2026-001

Every row already carries a tag and a unit (the quality column carries the OPC-style quality code Chapter 7 establishes — 192 = Good, 64 = Uncertain, 0 = Bad — so a downstream consumer never mistakes a bad reading for a trustworthy one). The discipline question is: who guarantees that BR101.DO.PV always means the same probe, in the same units, forever — and that nobody invents BR101.DissolvedO2 next quarter? The answer is a convention plus a register plus a gate. The convention says what a legal tag looks like; the register is the single list of every legal tag; the gate is automation that fails loudly when reality drifts from the register.

The convention the repo uses is deliberately boring, three dotted segments:

<UNIT>.<Measurement>.<Attr>     e.g.  BR101.Temp.PV

UNIT is a piece of equipment from the ISA-95 model we seeded in Chapter 4 — BR101 (the production bioreactor), N1SEED (a seed-train vessel), PA01 (the Protein A capture skid — a self-contained, pre-plumbed equipment unit mounted on a frame that performs the first chromatography purification step — whose own signals — PA01.UV280.PV in mAU, PA01.Phase.State — are worked in the downstream chromatography and contextualization chapters).
Measurement is the physical thing being measured — Temp, pH, DO, Agitation, Titer (the antibody concentration in g/L — a product-quality signal that, unlike Temp or DO, is typically measured offline by Protein A HPLC rather than read from a continuous probe).
Attr is the role of this particular number: PV (present/process value), SP (setpoint), MV (manipulated/output value), Rate, or State.

That last segment matters more than it looks. The difference between a setpoint (what we asked for) and a process value (what we got) is the difference between an instruction and a measurement — and conflating them is a classic data-integrity error. Keeping BR101.Temp.SP and BR101.Temp.PV as distinct, named signals is the naming-layer expression of an ALCOA+ principle (ALCOA+ is the regulators' shorthand for data that is Attributable, Legible, Contemporaneous, Original, and Accurate — plus Complete, Consistent, Enduring, and Available): a value must be attributable to its true source, whether that source is a person or, here, an automatically generated reading from a specific instrument [2][3].

Grounding the names in ISA-95

A convention with no map underneath it is just a string format. The reason BR101 is meaningful is that it resolves to a real position in the ISA-95 equipment hierarchy — enterprise → site → area → work-center/unit — the technology-agnostic address space the standard defines for exactly this purpose [4]. Our seeded model places the production bioreactor at Newark site → upstream area → BR101, and the Protein A skid at Newark → downstream → PA01. The tag dictionary's job is to carry that placement around with every signal, so a number is never an orphan.

In the companion code, that mapping lives at the top of examples/chapters/04-naming-uns/naming.py as a small, explicit table covering the units that carry sensor tags — a subset consistent with the seeded ISA-95 model:

examples/chapters/04-naming-uns/naming.py
TAG_RE = re.compile(r"^[A-Z][A-Z0-9]+\.[A-Za-z][A-Za-z0-9]+\.(PV|SP|MV|Rate|State)$")

# the equipment hierarchy the tags live under (matches the seeded ISA-95 model)
UNIT_AREA = {"BR101": ("newark", "upstream"), "N1SEED": ("newark", "upstream"),
             "PA01": ("newark", "downstream"), "PBR201": ("newark", "upstream")}
QUDT = {"degC": "DEG_C", "pH": "PH", "%sat": "PERCENT", "rpm": "REV-PER-MIN",
        "kg": "KiloGM", "bar": "BAR", "L": "L", "%": "PERCENT", "g/L": "GM-PER-L"}

UNIT_AREA lists only the units that actually carry sensor tags, so it overlaps but is not identical to the full seeded model: the fed-batch seed also defines TFF01 and FILL-LINE-01 (downstream units with no live tags here), while PBR201 in the table is the intensified/perfusion-variant bioreactor (perfusion continuously feeds fresh medium and removes spent medium so the culture runs for weeks at high cell density — the "continuous" alternative to fed-batch) — scaffolding for the continuous case, not part of the fed-batch seed. Two design decisions are worth pausing on. First, TAG_RE is the machine-readable contract: a tag is legal only if it is an uppercase-led unit code, a measurement, and one of a closed set of attribute roles. That closed set is what stops the slow rot of synonyms. Second, the QUDT table maps each plant unit-of-measure string to a QUDT unit IRI — QUDT (Quantities, Units, Dimensions and Types) is a public vocabulary of units, and an IRI (Internationalized Resource Identifier) is a URL-like global identifier (for example degC → http://qudt.org/vocab/unit/DEG_C). Units are part of a value's identity — "37" is meaningless until you know it is degrees Celsius — and pinning them to a global vocabulary now is what lets the Chapter 19 knowledge graph — the RDF graph that links batch, equipment, material, and result — reason across systems later. The %sat versus plain % distinction in the data above is exactly the kind of unit ambiguity this table forces you to resolve once, on paper, instead of arguing about it during a deviation investigation.

From tag to topic: the Unified Namespace

Anatomy of the tag string: asset, measurement, role

Before we project the tag anywhere, it pays to dissect the tag string itself, because every later form is built from its three segments. The canonical tag BR101.Temp.PV is not an opaque label; it is a tiny grammar — <asset>.<measurement>.<role> — and each dot-separated segment is load-bearing.

The canonical tag string BR101.Temp.PV exploded into three colored segment pills — asset, measurement, role — above two parallel forms: the slash-delimited UNS path newark/upstream/BR101/Temp and the five-level Sparkplug B topic, both re-projected from the same segments.

The asset segment (BR101) is the ISA-95 unit; it is the only segment that knows nothing about the measurement and everything about the equipment, which is why it alone determines the site and area levels of the UNS path. The measurement segment (Temp) is the physical quantity, carried through verbatim into both transport forms. The role segment (PV) is the one most often gotten wrong: it distinguishes a process value from a setpoint (SP), an output (MV), a Rate, or a State, and it is the closed set the TAG_RE pattern pins down. Read the grammar left-to-right and the tag tells you what equipment, what quantity, and what kind of number — with no lookup required.

The figure also shows the punchline of this chapter in one glance: the UNS path and the Sparkplug topic are re-projections of the same segments, not independent strings someone types by hand. asset and measurement appear unchanged in the UNS path; asset reappears as the Sparkplug device id. Because both forms are derived, the address space cannot drift from the register.

Where this sits in the trilogy

The canonical tag we are dissecting here is the code-level answer to a thread that runs through all three books. The first book frames the same naming-and-identity problem physically, where a global namespace is stamped onto the product itself in Packaging and Serialization. The second book treats it as a data-point — the mapping record that makes one system's label resolve to another's — in Semantic Interoperability. This chapter is where that record becomes an enforced row and a linter: the gov.tag_dictionary entry and naming.py are the implementation those two chapters were describing.

From tag to UNS path and Sparkplug topic

A tag dictionary makes signals consistent. A Unified Namespace makes them browsable. The UNS idea, popularized in the manufacturing-IoT community, is a single real-time hierarchy — semantically organized like the business itself, and broker-agnostic — that becomes the one place any system goes to find the current state of anything [5]. The strong recommendation, which we follow, is to shape that hierarchy on the ISA-95 levels — Enterprise / Site / Area / Line / Cell — rather than inventing a parallel taxonomy [6].

Concretely, the UNS is a tree of MQTT topics. MQTT is the lightweight publish/subscribe transport standardized by OASIS and as ISO/IEC 20922 [7] — publish/subscribe means a sender posts a message to a named topic and any number of interested receivers subscribe to it, with neither side needing to know about the other. A topic is just a slash-delimited path, like a folder. The community best practice is hierarchical, general-to-specific levels, all lowercase, no spaces — so the path is predictable and the single-level (+) and multi-level (#) wildcards do something useful [8]. With that path in hand, a dashboard can subscribe to newark/upstream/+/Temp — where the + stands in for any single level, here any unit — and instantly receive every reactor temperature in the upstream area without naming each one.

naming.py derives the UNS path directly from the canonical tag, so the address space can never disagree with the dictionary:

examples/chapters/04-naming-uns/naming.py
def uns_path(tag: str) -> str:
    unit, measurement, _attr = tag.split(".")
    site, area = UNIT_AREA[unit]
    return f"{site}/{area}/{unit}/{measurement}"


def sparkplug_topic(tag: str, group: str = "newark", edge: str = "edge1") -> str:
    unit = tag.split(".")[0]
    return f"spBv1.0/{group}/DDATA/{edge}/{unit}"

The figure below shows how one signal projects into all three views at once.

A bioreactor temperature signal shown projected into three aligned views: the canonical dotted tag, the slash-delimited UNS topic tree grounded in the ISA-95 enterprise-site-area-unit hierarchy, and the fixed-arity Sparkplug B topic.

One signal, three coordinated addresses: the canonical tag is the system of record, the UNS path is the human- and dashboard-browsable tree, and the Sparkplug B topic is the wire format. All three are generated from the same dictionary, so they cannot drift apart.

Original diagram by the authors, created with AI assistance.

Sparkplug B: a stricter cousin

Plain MQTT topics are free-form, which is their charm and their danger: nothing stops two teams from organizing the tree differently. Sparkplug B (the Eclipse Sparkplug 3.0 specification) is an opinionated profile on top of MQTT that fixes the topic structure to exactly five levels — spBv1.0/<group_id>/<message_type>/<edge_node_id>/<device_id> — and adds birth/death session-state management so subscribers always know whether a publisher is alive [9]. That is why sparkplug_topic() returns spBv1.0/newark/DDATA/edge1/BR101: the newark group, a DDATA (device-data) message, from edge node edge1 (the gateway at the plant edge that publishes on behalf of the device), for device BR101.

The two topic shapes are not redundant — they serve different audiences. The UNS path is the human-and-dashboard-browsable single source of truth; the Sparkplug topic is the disciplined, fixed-arity (always exactly the five levels above) wire format an edge gateway actually publishes on, which we wire up for real in the next chapter. Generating both from one dictionary is how a plant keeps a flexible browsing tree and a rigid transport contract without the two ever contradicting each other [6].

Two names for one thing: ISA-5.1 and Part 7 aliasing

So far we have assumed a clean slate. Real plants are not clean slates. The instrument loop on the P&ID (the piping-and-instrumentation diagram) has been called TT-101 since before the historian existed — a tag in the form defined by ISA-5.1, the instrumentation symbols and identification standard, where a first letter denotes the measured variable and the rest identifies the loop (TT = temperature transmitter, PIC = pressure indicating controller) [10]. The DCS exposes it as TIC_101.PV. The ERP knows it only as a cost-center asset number. Demanding that everyone rename to BR101.Temp.PV is a multi-year change-control project nobody will finance.

The standards-blessed answer is not to force one name but to declare them equivalent. ISA-95 Part 7, the Alias Service Model, exists precisely to reconcile the different identifiers different systems use for the same object, so the platform stays self-consistent even when one physical signal carries several names [11]. In our world this means the canonical BR101.Temp.PV is the system of record, and TT-101, TIC_101.PV, and the ERP asset number are aliases that all resolve to it. The tag dictionary is the natural home for that resolution, but because its tag primary key carries no retire/effective columns, the repo realizes the mapping as a companion effective-dated gov.tag_alias table — each row records the date (effective) from which the new name takes over, so a query can resolve either name as of any point in time (old_tag → new_tag, effective, reason; Chapter 27) — with a gov.v_tag_current view that resolves either name. Storing it as data (rather than burying it in glue code) is what makes the mapping auditable.

Storing the dictionary as governed data

A convention that lives only in a Python file is a suggestion. A convention that lives in a governed database table, with a primary key the database itself enforces and a linter the test suite runs, is policy. The companion repo's examples/platform/db/40-gov.sql creates the schema where the dictionary lands once the full stack is up — note the comments tying each column back to its downstream consumer:

examples/platform/db/40-gov.sql
-- The tag dictionary (Ch 5): every signal's canonical name, asset, unit, the
-- QUDT unit IRI, its UNS path and Sparkplug topic, and a deadband. The naming
-- linter (chapters/04-naming-uns/naming.py, exercised by tests/test_chapters.py
-- under `make test`) rejects any tag not matching the convention.
CREATE TABLE gov.tag_dictionary (
    tag           text PRIMARY KEY,         -- BR101.Temp.PV
    asset         text NOT NULL,            -- BR101
    measurement   text NOT NULL,            -- Temp
    unit          text NOT NULL,            -- degC
    qudt_unit     text,                     -- http://qudt.org/vocab/unit/DEG_C
    uns_path      text NOT NULL,            -- newark/upstream/BR101/Temp
    sparkplug_topic text NOT NULL,          -- spBv1.0/newark/DDATA/edge1/BR101
    data_type     text NOT NULL DEFAULT 'Double',
    deadband      numeric NOT NULL DEFAULT 0
);

The tag column is the primary key: the database itself now enforces "one canonical name, used once." uns_path and sparkplug_topic are stored, not recomputed at read time, so the Chapter 7 publisher and the Chapter 17 contextualization views read the same topic the linter approved. The deadband column is a quiet but important detail — it is the change threshold below which a reading is not worth publishing, and storing it next to the tag keeps capture policy governed rather than scattered across collector configs. The same gov schema also holds the jurisdiction policy (Chapter 26) and the supplier register (Chapter 25), so all the data about the data sits in one place.

build_dictionary() in naming.py produces exactly the rows this table expects — asset, measurement, unit, QUDT IRI, UNS path, and Sparkplug topic — from the live tag set, which is how the running stack, the seed data, and the book can never quietly disagree.

The same row, as a triple and a shape

The governed row is not only a SQL fact; it is one short step from a semantic one. The Chapter 19 knowledge graph loads exactly this kind of record as RDF triples (each fact a subject-predicate-object statement), and the dictionary entry above maps onto a handful of them — the tag becomes an IRI, the qudt_unit column becomes a real edge into the QUDT vocabulary, and the ISA-95 placement becomes a relation a query can walk:

# the BR101.Temp.PV dictionary row, the same fact as RDF triples
bp:BR101.Temp.PV  a            bp:Tag ;
                  bp:onAsset   bp:BR101 ;
                  bp:measures  bp:Temperature ;
                  qudt:unit    unit:DEG_C ;
                  bp:unsPath   "newark/upstream/BR101/Temp" .
bp:BR101  a bp:Unit ; bp:inArea bp:Newark-Upstream .

Two things the relational primary key only implies, the graph can make checkable. First, the closed attribute set the regex enforces is, in ontology terms, the classes-and-taxonomy decision to model PV, SP, MV, Rate, and State as a fixed enumeration — which a SHACL shape (the Shapes Constraint Language that validates a graph the way the linter validates a string) states declaratively, so the same contract guards both the wire and the graph:

# shapes.ttl — the graph-side mirror of TAG_RE's closed attribute set
bp:TagShape a sh:NodeShape ;
    sh:targetClass bp:Tag ;
    sh:property [ sh:path bp:onAsset ; sh:minCount 1 ; sh:class bp:Unit ] ;
    sh:property [ sh:path bp:role ;
                  sh:in ( "PV" "SP" "MV" "Rate" "State" ) ] .

The membership check the linter runs in Python — is BR101 a real unit in the hierarchy? — is the closed-world question the release-gate-and-SHACL chapter argues SHACL exists for: a missing or off-list value is a failure now, not an open question. Second, a single SPARQL competency question — give me every temperature signal in the upstream area — answers the same intent as the newark/upstream/+/Temp MQTT wildcard, but over the graph rather than the live broker:

SELECT ?tag WHERE { ?tag bp:onAsset ?u ; bp:measures bp:Temperature .
                    ?u bp:inArea bp:Newark-Upstream . }

That the wildcard and the query return the same set is the whole payoff of grounding both forms in one dictionary: the relations-and-genealogy spine that later walks a lot back to its cell bank, and the topic tree a dashboard browses, are two projections of the identity this chapter pins down. The alias mapping of the previous section is the same idea a third way — an effective-dated gov.tag_alias row is exactly a PROV-O style provenance edge (old_tag wasRevisionOf new_tag as of a date), so "which name was canonical on the day this batch ran?" is answerable as data, not archaeology.

Anatomy of a tag dictionary entry: one governed row

Look at a single populated row of gov.tag_dictionary — the entry for BR101.Temp.PV, generated from the real golden dataset (where that tag carries a degC reading of 37.0145 at quality 192 (Good) in BATCH-2026-001). Every column does a distinct, governed job, and together they are the reason a temperature reading is never an orphan number.

An identity card dissecting one row of gov.tag_dictionary for BR101.Temp.PV, field by field: the tag primary key (highlighted), asset, measurement, unit, the QUDT unit IRI, the UNS path, the Sparkplug topic, data type and deadband, plus a panel on where the row lives.

One row of gov.tag_dictionary: the tag is the database-enforced primary key, qudt_unit pins the unit to a global vocabulary, and uns_path and sparkplug_topic are stored — not recomputed — so every downstream reader gets the address the linter approved.

Original diagram by the authors, created with AI assistance.

The fields fall into three groups. The identity group — tag, asset, measurement — answers which signal is this and where does it live: tag is the canonical name and primary key, asset resolves to the ISA-95 unit BR101, and measurement is the physical quantity Temp. The meaning group — unit, qudt_unit, data_type — answers what does the number mean: the plant string degC, its machine-readable IRI http://qudt.org/vocab/unit/DEG_C, and the value type Double. The addressing-and-policy group — uns_path, sparkplug_topic, deadband — answers where does it travel and when is it worth publishing: the browsable path newark/upstream/BR101/Temp, the wire topic spBv1.0/newark/DDATA/edge1/BR101, and the change threshold (0 by default) below which a reading is suppressed. Critically, uns_path and sparkplug_topic are stored columns derived once by build_dictionary(), not formulas evaluated at read time — so a typo in someone's later query can never silently re-route a signal to a different topic than the one governance blessed.

The linter: a robot at the door

Here is where the hands-on book earns its title. A convention is only real if something enforces it automatically, on every change, forever. That something is the linter — and it is genuinely tiny. From examples/chapters/04-naming-uns/naming.py:

examples/chapters/04-naming-uns/naming.py
def lint_tag(tag: str) -> str | None:
    """Return None if the tag is valid, else a reason string."""
    if not TAG_RE.match(tag):
        return f"'{tag}' does not match <UNIT>.<Measurement>.<Attr>"
    unit = tag.split(".")[0]
    if unit not in UNIT_AREA:
        return f"unit '{unit}' is not in the equipment hierarchy"
    return None


def lint_dataset() -> tuple[pd.DataFrame, list[str]]:
    df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
    tags_units = dict(df.groupby("tag")["unit"].first())
    problems = [f"{t}: {lint_tag(t)}" for t in tags_units if lint_tag(t)]
    return build_dictionary(tags_units), problems

Two checks, two failure modes. First, does the tag match the structural contract TAG_RE? Second, even if well-formed, does its unit code correspond to a real piece of equipment in the hierarchy? A perfectly-shaped tag for a vessel that does not exist is just as wrong as a malformed one — it is an orphan with good handwriting.

The linter reads the full dataset the repo ships as fedbatch_timeseries.parquet — the same data as the …10min.sample.csv excerpt above (identical ts,tag,value,unit,quality,batch_id schema), just the complete set in a compact binary format that make data regenerates deterministically rather than a human-readable slice. Run it against the real golden dataset (python chapters/04-naming-uns/naming.py) and you get the actual, tested output — the first eight rows of the generated dictionary and a clean bill of health:

               tag asset measurement unit                              qudt_unit                        uns_path                  sparkplug_topic
BR101.Agitation.PV BR101   Agitation  rpm http://qudt.org/vocab/unit/REV-PER-MIN newark/upstream/BR101/Agitation spBv1.0/newark/DDATA/edge1/BR101
BR101.Agitation.SP BR101   Agitation  rpm http://qudt.org/vocab/unit/REV-PER-MIN newark/upstream/BR101/Agitation spBv1.0/newark/DDATA/edge1/BR101
       BR101.DO.PV BR101          DO %sat     http://qudt.org/vocab/unit/PERCENT        newark/upstream/BR101/DO spBv1.0/newark/DDATA/edge1/BR101
       BR101.DO.SP BR101          DO %sat     http://qudt.org/vocab/unit/PERCENT        newark/upstream/BR101/DO spBv1.0/newark/DDATA/edge1/BR101
    BR101.FeedA.PV BR101       FeedA   kg      http://qudt.org/vocab/unit/KiloGM     newark/upstream/BR101/FeedA spBv1.0/newark/DDATA/edge1/BR101
    BR101.FeedB.PV BR101       FeedB   kg      http://qudt.org/vocab/unit/KiloGM     newark/upstream/BR101/FeedB spBv1.0/newark/DDATA/edge1/BR101
BR101.OffgasCO2.PV BR101   OffgasCO2    %     http://qudt.org/vocab/unit/PERCENT newark/upstream/BR101/OffgasCO2 spBv1.0/newark/DDATA/edge1/BR101
 BR101.OffgasO2.PV BR101    OffgasO2    %     http://qudt.org/vocab/unit/PERCENT  newark/upstream/BR101/OffgasO2 spBv1.0/newark/DDATA/edge1/BR101

16 tags; lint problems: 0

All sixteen of the fed-batch tags pass. More importantly, the repo's test suite asserts both halves of the contract — that the real tags pass and that the linter actually rejects garbage. From examples/tests/test_chapters.py:

examples/tests/test_chapters.py
def test_ch04_naming_linter_passes_on_real_tags():
    import naming

    dictionary, problems = naming.lint_dataset()
    assert len(dictionary) == 16
    assert problems == []                       # every simulator tag obeys the convention
    assert naming.lint_tag("badtag") is not None  # and the linter actually rejects bad ones
    assert naming.uns_path("BR101.Temp.PV") == "newark/upstream/BR101/Temp"

To see why the gate matters, here is lint_tag() run on the kinds of drift that actually show up in plants:

'BR101.Temp.PV'    -> None
'TT101'            -> 'TT101' does not match <UNIT>.<Measurement>.<Attr>
'br101.temp.pv'    -> 'br101.temp.pv' does not match <UNIT>.<Measurement>.<Attr>
'TK205.Temp.PV'    -> unit 'TK205' is not in the equipment hierarchy
'BR101.Temp.Value' -> 'BR101.Temp.Value' does not match <UNIT>.<Measurement>.<Attr>

The raw P&ID tag (TT101), the lowercase typo (br101.temp.pv), the well-formed-but-unknown vessel (TK205), and the synonym creep (Value instead of PV) are each caught with a specific reason. In the companion repo this contract is wired into the test suite — make test runs test_ch04_naming_linter_passes_on_real_tags, which fails if any real tag breaks the convention — so naming is not a wiki page everyone ignores but an assertion that goes red. Promoting that same check to a pre-commit hook or a CI (continuous-integration — the automated build-and-test pipeline that runs on every change) gate in your own deployment is the next step, and the one a GxP (Good Practice — the umbrella for GMP/GLP/GCP regulated activities) SOP (Standard Operating Procedure — a controlled, approved written procedure) will ultimately require.

Why it matters

Naming is the first place ALCOA+ either succeeds or quietly fails. If you cannot point at a number and say which instrument produced it, in what units, on what equipment, in which batch, you cannot make it attributable or traceable to its source — and both the UK MHRA (Medicines and Healthcare products Regulatory Agency) and the US FDA data-integrity guidance treat that traceability as foundational [2][3]. The lowercase typo and the Value-for-PV synonym the linter rejects are exactly attributability failures in waiting, and the well-formed-but-unknown vessel (TK205) is the subtler trap that a regex-only check would miss — which is why the linter's second test, membership in the equipment hierarchy, exists at all. Every later capability in this book — the historian's joins, the chromatography phase events, the knowledge graph's lineage queries, the audit trail's hash chain — assumes that the address on each packet of data is correct and unique. Get naming right and those layers compose. Get it wrong and you spend the rest of the program reconciling synonyms by hand during deviation investigations, which is precisely the failure mode the peer-reviewed Bioprocessing 4.0 literature (the bioprocessing-specific take on Industry 4.0 — the drive toward connected, data-rich, automated manufacturing) identifies as endemic across the MES, DCS, and SCADA systems a plant runs [1].

The gate's value is therefore not that it is clever but that it is unblinking and early. A bad tag rejected at commit time costs a one-line fix; the same tag discovered during a deviation investigation costs a documented data-integrity finding and a manual reconciliation across every downstream system that already filed the orphan. Moving the check left — from a human review that happens sometimes to a CI assertion that happens always — is the entire difference between a convention people respect and one they quietly route around.

Naming is the join key every model leans on

The furthest-downstream consumer of this dictionary is the analytics and machine-learning layer, and it depends on the naming discipline in a way that is easy to miss until it bites. A Raman soft sensor or an out-of-spec predictor is trained by joining the dense historian stream to the sparse offline assays — and that join is keyed on exactly the identities this chapter pins down: the canonical tag and the batch_id. Book 5 makes the dependency explicit: the single most damaging mistake in bioprocess ML is data leakage (a held-out reading sneaking into training so the reported score is fantasy), and the fix is a batch-grouped split — every row of a batch goes wholly to train or wholly to test, using scikit-learn's GroupKFold or a leave-one-batch-out scheme — which is only possible if batch_id is a stable, single-valued, governed identifier rather than three spellings of the same run [1]. A DissolvedO2-versus-DO synonym splits one feature column into two half-populated ones, and an ungoverned batch_id lets a batch straddle the train/test line invisibly. The naming gate is the unglamorous precondition for the leak-free split and the grouped cross-validation the ML data chapter builds on.

Two further model concerns trace straight back to the columns of gov.tag_dictionary. The first is model lineage: a regulated model must record exactly which signals, in which units, it was trained on — and the canonical tag plus its qudt_unit IRI is that record, so a %sat-versus-% ambiguity resolved here is one that can never silently re-scale a model's input later. The second is drift, where naming draws a line worth keeping sharp. A model watches its inputs for data drift — the input distribution moving away from the training set, caught label-free by a statistic like the Population Stability Index — and the historian's quality code and deadband on each tag are part of what keeps that signal trustworthy; but process drift, the living culture itself wandering batch to batch, is a different thing measured on the same well-named signals [1]. The MLOps chapter separates the two, and both detectors assume the input they monitor has one stable name. The same quality-and-applicability-domain reflex shows up at inference: a model that declines to trust a reading flagged Uncertain is doing, in real time, what the linter does at commit time — refusing an input it cannot vouch for.

In the real world

The honest picture is that open source gets you most of the way here, and unusually cleanly. The convention is just discipline; the UNS is a topic tree on a free MQTT broker; the dictionary is a Postgres table; the linter is fifty lines of Python you fully own (yours to copy and adapt). There is no commercial product you need to buy to name things well — which is why naming is one of the few layers where the pure-OSS answer is genuinely complete.

What OSS does not hand you is the surrounding governance, and this is the candid part. Commercial UNS and namespace tooling — HighByte Intelligence Hub, the Ignition platform's tag system, AVEVA's asset framework — ship with graphical modelers, role-based change control, and bundled aliasing services that our gov.tag_dictionary + linter approach reproduces only if you wire the change-control around it. The standards we lean on are real and current: ISA-95 Part 1 is current as of its 2025 edition (ANSI/ISA-95.00.01-2025) [4], Part 7's Alias Service Model is the blessed way to reconcile names across systems [11], ISA-5.1 still governs the loop tags on the P&ID [10], and Sparkplug 3.0 and MQTT are the de-facto UNS transport [9][7]. But a standard is a contract, not an enforcer. The enforcer is your CI gate — and in a GxP context, the validated SOP that says no tag enters production without passing it — the kind of computerized-system control 21 CFR Part 11 and EU Annex 11 expect for electronic records. A greenfield site is exactly the kind of place where settling one namespace before the first probe is wired is the difference between a coherent data platform and a decade of reconciliation. We name our simulated site newark — a nod to that greenfield-site reality, where naming is settled before the first probe is wired.

The harder case is the brownfield one, and it is where naming meets tech transfer — moving a validated process from a development lab or a sending site to a receiving manufacturing site. A process characterized on a 15 mL Ambr or a bench reactor and scaled to a 2,000 L stirred tank does not just change vessel size; it changes which instruments exist and what they are called, and the receiving site almost never shares the sending site's tag scheme. This is precisely why the canonical-name-plus-alias design matters beyond one plant: the site and area segments of the UNS path let two sites — a boston/development skid and our newark/upstream/BR101 train — carry the same measurement under site-distinct addresses, while the ISA-95 Part 7 alias table reconciles the sending site's TT-101 with the receiving site's canonical tag without renaming either. And because the qualification of a control or data system at the receiving site — the IQ/OQ/PQ sequence (Installation, Operational, and Performance Qualification: documented evidence that a system is installed right, runs right, and performs right in routine use) — must verify that every signal is captured under its approved name with its approved units, a frozen tag dictionary is exactly the as-built specification those protocols test against. The downstream half of the train carries the same burden: the Protein A capture skid PA01, whose own signals (PA01.UV280.PV in mAU, PA01.Phase.State) ride this convention, is a single-use or self-packed unit operation whose tags must transfer site-to-site as cleanly as the bioreactor's — naming is what lets a capture chromatography step prove it is the same operation at scale.

One more candid note for production: our linter checks structure and equipment membership, not semantic duplication. It will happily accept both BR101.DO.PV and a hypothetical BR101.DissolvedOxygen.PV if someone adds the second measurement name to the hierarchy. Catching that — true synonym detection — needs the Part 7 alias table populated and reviewed by a human, which is governance, not regex.

Key terms

Tag — a named signal, here <UNIT>.<Measurement>.<Attr> (e.g. BR101.Temp.PV); the postal address of a data point.
Canonical name — the one official tag for a signal; the primary key of the dictionary and the system of record.
Unified Namespace (UNS) — a real-time, broker-agnostic hierarchy that is the single source of truth for current plant state, organized like the business and (here) shaped on ISA-95 levels [5].
MQTT topic — a slash-delimited publish/subscribe path; supports + (single-level) and # (multi-level) wildcards [7][8].
Sparkplug B — an opinionated MQTT profile fixing the topic to spBv1.0/<group>/<msg_type>/<edge>/<device> and adding birth/death state [9].
ISA-5.1 instrument tag — the P&ID loop label (e.g. TT-101, PIC) where the first letter is the measured variable [10].
ISA-95 Part 7 aliasing — the Alias Service Model that declares different system identifiers equivalent to one canonical object [11].
QUDT unit IRI — a global identifier for a unit of measure (e.g. .../unit/DEG_C) that pins a value's units machine-readably.
Tag dictionary — the governed register (gov.tag_dictionary) of every legal tag, its hierarchy placement, unit, UNS path, and Sparkplug topic.
Deadband — the gov.tag_dictionary column holding the change threshold below which a new reading is not worth publishing; storing it next to the tag keeps capture policy governed rather than scattered across collector configs.
Stored (vs. derived) column — uns_path and sparkplug_topic are computed once by build_dictionary() and persisted as columns, not recomputed at read time, so a later query can never re-route a signal to a topic governance did not approve.
Linter (naming.py) — the check that rejects any tag not matching the convention or not present in the equipment hierarchy; in the repo it runs via the test_ch04_naming_linter_passes_on_real_tags test under make test, and is the natural thing to promote to a pre-commit hook or CI gate in production.
SHACL shape — a declarative constraint (Shapes Constraint Language) that validates RDF triples the way the linter validates a tag string; here bp:TagShape mirrors TAG_RE's closed attribute set and the hierarchy-membership check into the knowledge graph.
Batch-grouped split — the ML validation discipline of keeping every row of a batch wholly in train or wholly in test (GroupKFold / leave-one-batch-out), which only works when batch_id is one governed identifier; the precondition for a leak-free model [1].
Data drift vs. process drift — data drift is the model's input distribution moving away from its training set (watched label-free); process drift is the living culture itself wandering batch to batch — two different things measured on the same well-named signals (see MLOps).
Tech transfer / IQ-OQ-PQ — moving a validated process to a receiving site and qualifying its systems (Installation/Operational/Performance Qualification); a frozen tag dictionary is the as-built naming specification those protocols test against, with the Part 7 alias table reconciling site-to-site labels.

Where this leads

Every signal now has a name the whole platform agrees on, a place in the ISA-95 tree, a UNS path to be browsed by, a Sparkplug topic to ride on, and a robot at the door that keeps it all honest. We have built the address book. But a name is a label on a box; the next chapter says what is inside it. A Canonical Bioreactor Information Model: Companion Specs, NodeSet2, and Semantic Alignment turns the named signals into a platform-agnostic, device-modular OPC UA model — typed against companion specifications, unit-bearing, and aligned to a shared ontology — so two vendors' servers can expose the same meaning, not just the same protocol, before Part II stands up the wire that carries it.

What this chapter covers​

One signal, one canonical name​

Grounding the names in ISA-95​

From tag to topic: the Unified Namespace​

Anatomy of the tag string: asset, measurement, role​

From tag to UNS path and Sparkplug topic​

Sparkplug B: a stricter cousin​

Two names for one thing: ISA-5.1 and Part 7 aliasing​

Storing the dictionary as governed data​

The same row, as a triple and a shape​

Anatomy of a tag dictionary entry: one governed row​

The linter: a robot at the door​

Why it matters​

Naming is the join key every model leans on​

In the real world​

Key terms​

Where this leads​