The Edge Gateway: Routing Floor Data with Node-RED, Telegraf & NiFi

📍 Where we are: Part II, "Capturing the Process." The signal already leaves the bioreactor (the vessel where the cells grow) as OPC UA (the industrial protocol the sensors speak) and lands on the MQTT broker (the lightweight message hub that relays it) — all set up in the previous chapter. Now we build the gateway that collects, transforms, and routes it onward — and we decide which of three open-source tools does the job.

The simple version

Think of the edge gateway as the mailroom of the factory. Sensors all over the floor keep dropping envelopes (measurements). The mailroom sorts them, re-labels anything that arrived in a weird format, decides which ones go to which department, and — crucially — keeps a logbook of every envelope it touched. A sloppy mailroom loses mail. A regulated mailroom can tell you, months later, exactly which envelope it received, when, where it came from, and where it sent it. This chapter builds three kinds of mailroom and picks the right one for each job.

The edge gateway sits on the fault line of the whole platform: the seam between OT (operational technology — the controllers, skids, and sensors running the process) and IT (the databases, dashboards, and analytics that make sense of it). On the OT side, data speaks OPC UA and Modbus and lives on isolated control networks; on the IT side it speaks MQTT, SQL, and HTTP. Something has to stand in the middle, translate, buffer, and route — without ever touching the validated control loop (the closed sensor-to-controller-to-actuator loop — a sensor reads, a controller decides, an actuator such as a pump or valve acts — that actually runs the process, formally proven and documented to work correctly under GxP (the family of "good-practice" regulations governing regulated work) so it may not be altered without re-qualifying it — see Biologic Drug Manufacturing, Quality, Regulatory, and Data). That something is the gateway, and getting it right is the difference between data you can submit to a regulator and data you merely hope is complete.

What this chapter covers

Why the OT/IT bridge exists and what a gateway must do at that seam.
The three open-source tools we ship — Node-RED (low-code flows), Telegraf (declarative collection), and Apache NiFi (guaranteed delivery + replayable provenance) — and what each is genuinely good at.
How they route data from the broker into the historian (ts.sensor_reading) and the batch model, with the real long-format rows you'll see.
The honest part: delivery semantics (at-least-once vs. exactly-once), the audit-trail gap, and where the OSS gateway hands off to the validated record-of-truth.

The seam: what a gateway actually does

A gateway is not a database and not a dashboard. Its job, distilled, is a southbound–transform–northbound pipeline. Southbound, it speaks the field protocols — it subscribes to the bioreactor's OPC UA address space — the structured tree of readable items an OPC UA server exposes — asking that server to push new readings as they change [1] and, for older skids (a skid is a self-contained, frame-mounted unit of process equipment), polls Modbus registers (Modbus is the older, simpler industrial protocol that exposes data as numbered memory slots called registers the gateway reads on a timer). In the middle it normalizes: a raw register value of 3725 becomes 37.25 degC (many devices send integers with the decimal point implied — here a fixed factor of 100 — so the gateway applies the scale), a vendor's tag (the device's own name for a signal) becomes our canonical BR101.Temp.PV (read asset.measurement.role: bioreactor BR101, its temperature, the process value — the current measured reading, as opposed to a setpoint), and a missing unit is filled in. Northbound, it routes the cleaned record to wherever it needs to live — onto the MQTT broker as a Unified Namespace topic (the single canonical real-time hierarchy of topics we build in Naming Things), or straight into TimescaleDB (PostgreSQL with a time-series extension; the historian database).

A peer-reviewed reference design for exactly this shape — a modular edge gateway with a southbound protocol-translation layer for Modbus/MQTT/OPC UA and a cache that decouples acquisition from transmission so a network hiccup never drops a sample — was published in 2026 [2]. That decoupling is the whole game. The floor never stops producing data; the network sometimes stops carrying it. A gateway that buffers locally and forwards when the link returns is the only kind that preserves a complete record. A second analysis of an OPC UA gateway on embedded hardware quantifies the other tension: OPC UA is a rich, heavy, self-describing stack, while MQTT is a light pub/sub transport (publish/subscribe — senders publish messages to named topics and receivers subscribe to the topics they care about, with no direct connection between them). "Heavy" means OPC UA carries a lot of structure and metadata per exchange and costs more CPU and bandwidth; "light" means MQTT moves small messages cheaply — so a gateway routinely reads heavy OPC UA southbound and emits light MQTT northbound [3].

A bioprocess edge gateway bridging the OT control network and the IT data network, showing southbound OPC UA and Modbus inputs, an in-gateway transform and buffer stage, and northbound MQTT and TimescaleDB outputs, with the three open-source tools mapped to the stage each is best at.

The edge gateway as the OT/IT seam: it reads field protocols southbound, normalizes and buffers in the middle, and routes northbound to the broker and historian. Node-RED, Telegraf, and NiFi each occupy a different point on the same pipeline. Original diagram by the authors, created with AI assistance.

Southbound, transform, northbound

It is worth naming the three stages precisely, because the failure modes differ at each and the tool choice later in the chapter maps directly onto them.

Southbound is the field side. The gateway is a client: it subscribes to the bioreactor's OPC UA address space [1] or polls Modbus registers, and it reads value and StatusCode together — never the value alone. (The StatusCode, or quality flag, is the field's own verdict on whether the reading is trustworthy — Good, Uncertain, or Bad; the codes are unpacked under "Where the data lands.") The cardinal sin here is reading the number and dropping the status, because then every value looks equally trustworthy. Southbound is also where the acquisition/transmission decoupling lives: a local buffer that keeps acquiring while the northbound link is down is what makes the record complete rather than merely recent [2].
Transform is the middle. A raw register 3725 becomes 37.25 degC; the vendor tag becomes the canonical BR101.Temp.PV; a missing unit is filled; the quality flag is carried through untouched. Because OPC UA is a heavy, self-describing stack and MQTT is a light transport, the transform is also where a rich southbound payload is thinned to a compact northbound message [3]. This stage is also, as the next section shows, exactly where in-process loss hides.
Northbound is the IT side. The cleaned record is routed onward — onto the broker as a Unified Namespace topic, or straight into TimescaleDB as a ts.sensor_reading row. Northbound is where delivery-semantics (the QoS choice) and idempotency (de-dup on (ts, tag)) matter, because this is the hop that crosses the network into the system of record.

Three tools, three jobs

We ship all three because no single tool wins on every axis. The trick is matching the tool to the job.

Node-RED — the low-code flow editor

Node-RED is a browser-based, low-code editor where you wire small functional nodes into a flow; the flow itself is stored as JSON and runs on Node.js [4]. It is the fastest way to get an idea onto the floor: drag an mqtt in node, a function node to reshape the payload, and a postgres node, connect them, and you are ingesting. Process engineers who would never write a daemon will happily build a Node-RED flow.

Because flows are JSON, they live in Git like any other config-as-code artifact. A minimal flow that takes a Sparkplug payload (the payload is the data body of an MQTT message; Sparkplug B is the spBv1.0/... payload specification for industrial data that standardizes how each measurement, its timestamp, and its quality are packed into that body) off the broker, picks out one metric, and inserts a row into the historian looks like this — this is a realistic Node-RED flow export, not a tested artifact in the companion repo; a gateway like it would run behind an opt-in profile rather than the core profile:

// edge/node-red/flows.json  (realistic config — capture profile, not in the core compose)
[
  { "id": "mqtt-in", "type": "mqtt in", "topic": "spBv1.0/newark/DDATA/edge1/BR101",
    "qos": "1", "broker": "mosquitto", "wires": [["to-row"]] },
  { "id": "to-row", "type": "function", "name": "sparkplug -> sensor_reading",
    "func": "const m = msg.payload.metrics[0];\nmsg.payload = {\n  ts: new Date(m.timestamp).toISOString(),\n  tag: m.name, value: m.value, unit: m.properties.unit,\n  quality: m.is_null ? 0 : 192, batch_id: flow.get('batch_id')\n};\nreturn msg;",
    "wires": [["to-pg"]] },
  { "id": "to-pg", "type": "postgresql", "name": "ts.sensor_reading",
    "query": "INSERT INTO ts.sensor_reading (ts,tag,value,unit,quality,batch_id) VALUES ($1,$2,$3,$4,$5,$6)" }
]

(The quality: m.is_null ? 0 : 192 line uses the historian's legacy OPC DA convention — 192 Good, 0 Bad — unpacked in full under "Where the data lands.")

Node-RED's strength is also its limit. It ships only basic authentication and a thin permission model, so in a GxP setting (the good-practice family introduced above — GMP for manufacturing, GLP for labs, GCP for clinical) you cannot prove who changed a flow or grant fine-grained roles (role-based access control, RBAC — permissions tied to a user's job role) without an enterprise add-on. We treat it as the prototyping and light-glue layer, and we say so out loud.

Telegraf — declarative collection

Where Node-RED is interactive, Telegraf is the opposite: a single Go binary configured entirely by a TOML file, with a plugin model — inputs, processors, aggregators, outputs — that you compose declaratively [5]. There is no canvas and no clicking. You write the config, version it, and the agent does exactly what the file says, every time. That determinism is precisely what you want for steady, high-rate metric collection.

A Telegraf config that consumes the broker's UNS topics and writes straight to PostgreSQL/TimescaleDB is short — again, a realistic capture-profile artifact, labelled as such:

# edge/telegraf/telegraf.conf  (realistic config — capture profile)
[agent]
  interval = "5s"
  flush_interval = "5s"
  omit_hostname = true

[[inputs.mqtt_consumer]]
  servers = ["tcp://mosquitto:1883"]
  topics  = ["newark/+/+/+"]   # UNS path: site/area/asset/measurement (Ch.5); + matches one level
  data_format = "json"
  json_time_key = "ts"
  json_time_format = "2006-01-02T15:04:05Z07:00"   # Go's reference-time layout, not a literal date: it just spells out the ISO-8601 timestamp shape

[[outputs.postgresql]]
  connection = "host=postgres user=bioproc password=bioproc dbname=bioproc"
  table_template = "INSERT INTO ts.sensor_reading (ts,tag,value,unit,quality,batch_id) VALUES (...)"

The cost of that simplicity is that Telegraf collects and forwards; it does not give you a per-message audit trail. It will faithfully drop a message it cannot parse and move on. For monitoring stack health (we reuse it for exactly that in the operations chapters) it is ideal. For a regulated batch record it is a collector, not a system of record.

Apache NiFi — guaranteed delivery and replayable provenance

The third tool is the one that earns its keep when the data is regulated. Apache NiFi routes data as FlowFiles through a directed graph of processors, and for every FlowFile it creates, forks, clones, modifies, or sends, it writes a provenance event into a repository you can query and replay [6]. That is the closest any open-source edge tool comes to an end-to-end data-flow audit trail. You can ask NiFi, after the fact, "show me the lineage of this record" and it will reconstruct who/what/when/from-what — the same shape the W3C PROV ontology (a shared, machine-readable vocabulary that pins down a few core kinds of thing and how they relate) defines as entities, activities, and agents [7]. When the record must be defensible months later, that chain of custody is the feature.

NiFi's lighter sibling, MiNiFi, pushes the same idea down to the source: a small-footprint agent designed explicitly for "generation of data provenance with full chain of custody of information" right at the device [8]. On a constrained edge box next to the skid, MiNiFi collects and stamps provenance, then hands off to a central NiFi.

The price is weight. NiFi is a JVM (Java Virtual Machine) application (Java 21, ~2 GB of RAM) with its own provenance and content repositories, which is why a gateway like it belongs behind an opt-in profile rather than the always-on core stack — these gateway configs are illustrative, and the companion repo's capture profile ships a lighter OPC UA server-plus-collector mirror in their place. Its provenance is configured in nifi.properties, and the relevant lines — realistic configuration, not a core-profile artifact — are:

# edge/nifi/nifi.properties  (realistic config — capture profile, provenance enabled)
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.directory.default=./provenance_repository
nifi.provenance.repository.max.storage.time=180 days
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, filename, ProcessorID

Set max.storage.time to your retention requirement and NiFi keeps a replayable record of every FlowFile it handled for that window.

Where the data lands

Whichever tool routes it, the data converges on the same target: the historian hypertable ts.sensor_reading (a hypertable is TimescaleDB's automatically time-partitioned table — one logical table, transparently split into time chunks for fast time-series queries), defined exactly once in the shared platform schema and joined to the batch model. The destination database itself comes up with the always-on core profile. From examples/platform/compose/compose.yaml:

# examples/platform/compose/compose.yaml
services:
  postgres:
    # timescale/timescaledb IS PostgreSQL + TimescaleDB, so the historian
    # hypertable and the ISA-88/95 batch model live in one joinable database.
    image: timescale/timescaledb:2.17.2-pg17
    profiles: ["core"]
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-bioproc}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-bioproc}
      POSTGRES_DB: ${POSTGRES_DB:-bioproc}
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ../db:/docker-entrypoint-initdb.d:ro   # 00-60 schema files run on first init

  mosquitto:
    image: eclipse-mosquitto:2.0.22
    profiles: ["core"]
    ports: ["1883:1883"]

Note the deliberate choice in the comment: TimescaleDB is PostgreSQL, so the high-rate sensor history and the ISA-88/95 batch context (ISA-88 models the batch/recipe procedure, ISA-95 the equipment hierarchy — the recipe-and-equipment model built in The Batch & Equipment Data Model) live in one database you can join. The gateway's only job is to land a clean row; the meaning comes from the join, which later chapters build.

The rows the gateway produces are dead simple — long format, one measurement per row. Here is the real shape, taken from examples/datasets/fedbatch_timeseries_10min.sample.csv:

ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.Agitation.PV,81.4323,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Agitation.SP,81.6008,rpm,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001

That quality column is not decoration. 192 is the legacy OPC DA (Classic) Good code (0xC0, hexadecimal for 192) the field side passes through. OPC DA (Classic) is the older, pre-OPC-UA generation of the OPC standard; many installed devices still emit its status codes. Watch the collision: recall from the previous chapter (Chapter 7, Connectivity: OPC UA and MQTT) that OPC UA-native quality is a different StatusCode whose Good is 0 — so 0 means Good in OPC UA but Bad in OPC DA. Our historian stores the OPC DA convention (192 Good), and the gateway carries the quality flag through untouched so that a dashboard or a reviewer can later distinguish a real reading from an uncertain or bad one. Throwing that field away at the edge is a classic data-integrity own-goal — you have silently made every value look equally trustworthy.

Anatomy of a sensor_reading row: the gateway's output contract

That CSV line is the gateway's entire deliverable, distilled to a single row. Whatever tool routes it, the contract it must satisfy is fixed by the historian schema, defined exactly once in examples/platform/db/20-historian.sql:

-- examples/platform/db/20-historian.sql
CREATE TABLE ts.sensor_reading (
    ts       timestamptz      NOT NULL,
    tag      text             NOT NULL,
    value    double precision,
    unit     text,
    quality  smallint         NOT NULL DEFAULT 192,  -- legacy OPC DA: 192 Good, 64 Uncertain, 0 Bad
    batch_id text
);

Six columns, and every one of them earns its place. Read field by field, this is the smallest unit of evidence the platform trusts:

ts (timestamptz NOT NULL) — the contemporaneous source timestamp, the "C" (Contemporaneous) and the "O" (Original) in ALCOA+ — the data-integrity checklist (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available). It is the time the value was measured, stamped by the OPC UA server or the broker, not the time it happened to be inserted. This is also the partition key of the TimescaleDB hypertable, so it is never optional.
tag (text NOT NULL) — the signal's canonical identity, e.g. BR101.DO.PV. The gateway's transform stage rewrites the vendor's tag into this asset.measurement.role form so two sources of the same signal converge on one name. .PV is the process value (the measured evidence); .SP is the recipe setpoint — the target value the batch recipe (the defined, parameterized production procedure, in the ISA-88 sense) commands the controller to hold — the same measurement, a different role, and the CSV above shows both as separate rows.
value (double precision, nullable) — the number itself, and nothing else. The BR101.DO.PV reading of 40.8224 above, for instance, is dissolved oxygen held near a 40 %sat (percent of air saturation) setpoint — a standard CHO (Chinese Hamster Ovary, the dominant mAb — monoclonal antibody, the protein drug this whole process makes — production cell line) target that keeps the culture aerobic without oxygen toxicity — not an opaque float. It is nullable on purpose: a sensor that drops out produces a row with a null value and a non-Good quality, which is more honest than no row at all.
unit (text, nullable) — what 40.8224 means. %sat, rpm, kg, degC — a bare number is not a measurement. The gateway fills a missing unit during normalization so the historian never stores an ambiguous value.
quality (smallint NOT NULL DEFAULT 192) — the trust flag, and the most important field in the row precisely because it is the one most often discarded. The schema's own comment maps it: 192 Good, 64 Uncertain, 0 Bad — the legacy OPC DA (Classic) status codes the edge node passes through, as Chapter 7 established [3]. The NOT NULL DEFAULT 192 is a deliberate, slightly dangerous choice: it means a careless loader that omits quality gets "Good" for free, so the discipline of carrying the real status through the transform is on the gateway, not the database.
batch_id (text, nullable) — the join key. On its own a row is a number in time; with batch_id it becomes evidence about a specific GMP (Good Manufacturing Practice) batch, joinable to the ISA-88/95 model in the same database. It is nullable because a reading taken between batches (a CIP cycle, an idle skid) is still real and still worth keeping. It is also what stitches the upstream signal this chapter captures to the downstream purification record that gives a lot its identity: the same BATCH-2026-001 that produced these bioreactor rows carries forward through Protein A capture, low-pH viral inactivation, polishing, viral filtration, and UF/DF — the train Biologic Drug Manufacturing walks from Capture chromatography through UF/DF and the drug substance. The gateway never sees a chromatography column or a virus filter, but by stamping the same batch_id on the in-process stream it makes the bioreactor trend, the capture eluate titer, and the final release panel one joinable thread — which is exactly the genealogy the later contextualization and knowledge-graph chapters walk.

Notice what is absent: there is no surrogate primary key. A reading's identity is the (ts, tag) pair, which is exactly why the de-duplication strategy for QoS 1 (MQTT's at-least-once delivery, which can deliver a message twice — defined under "Delivery semantics" below) — "receive twice, keep once" — is ON CONFLICT or a dedup on (ts, tag) rather than on some auto-increment id. The long format (one measurement per row, rather than a wide column-per-sensor table) is the other deliberate decision: a new tag is just more rows, never a schema migration, so the gateway can onboard a sensor it has never seen without anyone touching the historian's DDL.

Identity card for one ts.sensor_reading row, labelling all six columns — ts, tag, value, unit, quality (highlighted as the trust flag), and batch_id — with the long-format rationale. One historian row, field by field: the gateway's output contract is six columns, and the quality flag is the one most often thrown away. Original diagram by the authors, created with AI assistance.

Where this row comes from

This six-column row is the open-source implementation of a story the other two books tell. In Biologic Drug Manufacturing, the same kind of trace is the data-logger record that follows a vial through cold-chain distribution — the physical event being measured. In Data Management in Biopharmaceutical Manufacturing, that measurement is treated as the data-point born at the source, with all its integrity stakes, in Instruments and Sensors. What was a physical reading there, and a labelled data-point there, is here a concrete ts.sensor_reading row the gateway must land complete and quality-flagged.

In the companion repo, the chapters from 5 through 13 build this ingest path up piece by piece. So that the later contextualization and ALCOA+ chapters have something to query right away, the repo also ships one script that does the whole load at once — examples/tools/load_datasets.py. Its time-series loader is a textbook bulk ingest, exactly what a production gateway batches up northbound:

# examples/tools/load_datasets.py
def load_timeseries(conn) -> int:
    df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
    buf = io.StringIO()
    df[["ts", "tag", "value", "unit", "quality", "batch_id"]].to_csv(buf, index=False, header=False)
    buf.seek(0)
    with conn.cursor() as cur:
        cur.execute("TRUNCATE ts.sensor_reading")
        with cur.copy("COPY ts.sensor_reading (ts, tag, value, unit, quality, batch_id) "
                      "FROM STDIN WITH (FORMAT csv)") as copy:
            copy.write(buf.read())
    return len(df)

The same script also routes the offline lab data into a different schema, and that detail matters for the gateway story: the offline lab path can name a human actor (app.user) and so is attributable, whereas the high-rate gateway sensor stream is not — which is exactly why the regulated audit trail attaches to the lab and batch tables, never to the gateway. Notice it stamps an attributable actor before it writes — set_config('app.user', 'loader', ...) — so the database's audit trigger (a trigger is a small routine the database runs automatically on every insert or update; this one writes a who-and-what entry to the audit trail) records who introduced the row:

# examples/tools/load_datasets.py
def load_offline(conn) -> int:
    df = pd.read_csv(DATA / "offline_assays.csv", parse_dates=["sample_time"])
    n = 0
    with conn.cursor() as cur:
        cur.execute("SELECT set_config('app.user', 'loader', false)")
        for _, r in df.iterrows():
            cur.execute(
                "INSERT INTO lab.sample (sample_id, batch_id, sample_time, sample_point, sample_type) "
                "VALUES (%s,%s,%s,%s,'in_process') ON CONFLICT (sample_id) DO NOTHING",
                (r.sample_id, r.batch_id, r.sample_time.to_pydatetime(), r.sample_point))
            ...
    return n

You run the whole thing with one make target the book prints verbatim:

make load     # load the datasets into the running stack (historian + lab + genealogy)
# -> loaded: 322560 sensor readings, 1344 offline results, 66 release results, 30 genealogy edges

The honest part: delivery semantics and the audit gap

Here is where we stop selling and start confessing. A gateway's most important promise is completeness — that every measurement the floor produced actually arrived. That is the "C" (Complete) in ALCOA+, and the MHRA's (the UK medicines regulator) data-integrity guidance names it explicitly: data must be complete, with nothing silently dropped [9]. The FDA's CGMP (current Good Manufacturing Practice — the US GMP regulations) data-integrity Q&A makes the same demand from the other direction — all CGMP data must be complete, reliable, and accurate [10].

Delivery semantics: QoS and where loss actually happens

Completeness at the edge comes down to MQTT Quality of Service. The MQTT specification defines three levels: QoS 0 (at-most-once — fire and forget, messages can be lost), QoS 1 (at-least-once — guaranteed delivery but possible duplicates), and QoS 2 (exactly-once — guaranteed and de-duplicated, at the cost of a four-step handshake) [11]. The Node-RED flow above sets "qos": "1" on purpose: in bioprocess data you would rather receive a measurement twice (and de-duplicate on (ts, tag)) than lose it once. Choosing QoS 0 to save bandwidth is, in a regulated context, choosing to make your record incomplete.

But QoS only protects the transport — the single hop from publisher to subscriber. It says nothing about what happens inside the gateway when a transform throws, when the disk fills, or when the process restarts mid-flow. That is the trap: trace one message from the floor to the historian and the gap is obvious. QoS guards the broker-to-gateway hop; everything above the broker — the normalize step, the in-gateway buffer, the northbound insert — is unprotected by it. A message that arrives perfectly under QoS 1 and then dies in a transform exception is lost with a completely clean transport log.

A sequence diagram tracing one measurement across four lifelines — field/broker, gateway-in, transform-and-buffer, and historian — showing the QoS choice at publish, the hop QoS protects, the transform-and-buffer stage where loss actually happens, the NiFi provenance emit, and the northbound INSERT into ts.sensor_reading.

This is exactly where Node-RED and Telegraf fall short and NiFi shines: NiFi's FlowFile model is transactional and its provenance repository records the fate of every message, so a dropped or rerouted record is visible and replayable rather than silently gone [6]. For a record that may be audited, "we think it all got through" is not an answer; "here is the provenance event for every FlowFile and here are zero failures in the dead-letter relationship" is.

Anatomy of a provenance event (and why it is not an audit trail)

When NiFi creates, forks, clones, modifies, or sends a FlowFile, it writes one provenance event into the repository configured in nifi.properties. Each event answers the three questions the W3C PROV ontology defines — entity, activity, agent — plus the when that PROV-O records as a time property, and it is worth dissecting one, because its shape is almost an audit trail and the gap between "almost" and "is" is the whole compliance story. A single RECEIVE/ROUTE/SEND event records, in PROV-O terms [7]:

entity — the FlowFile: its UUID (FlowFileUUID, one of the indexed fields), filename, size, and the attributes attached at that moment. This is the thing the event is about — the message carrying our BR101.DO.PV reading.
activity — the EventType: RECEIVE, ROUTE, CLONE, SEND, DROP, and so on, plus the ProcessorID that performed it. This is what happened to the entity.
agent — the processor / component: which NiFi processor was responsible. Note this is a software agent, not a person.
time — the when, which PROV-O records as a time property (not a fourth class): the event timestamp, kept for max.storage.time (180 days in our config), with the lineage links back to the FlowFile this one was derived from.

For the formal PROV-O model itself — and how the same prov:Activity/prov:Entity pattern reconciles two conflicting source claims without an owl:sameAs over-merge — see Ontologies for Biopharmaceutical Manufacturing, Identifiers and Units and Maintenance: Publication and FAIR.

Written as RDF, that "almost-audit" mapping is not metaphor — it is four triples about one event, and the gateway's own row is the entity they are about:

# the provenance event, as PROV-O triples (illustrative)
ex:event-7c3 a prov:Activity ;                 # the EventType (RECEIVE / ROUTE / SEND)
    prov:used        ex:flowfile-9af ;         # the FlowFile carrying the BR101.DO.PV reading
    prov:wasAssociatedWith ex:nifi-PutDatabaseRecord ;   # a SOFTWARE agent, not a person
    prov:endedAtTime "2026-01-05T00:00:00Z"^^xsd:dateTime .
ex:nifi-PutDatabaseRecord a prov:SoftwareAgent .         # the gap: never a prov:Person

That prov:SoftwareAgent typing is the whole compliance story in one triple: a Part 11 audit trail needs a prov:Person (the human who) and a reason (the why), and the provenance graph structurally has neither. The same modeling pattern lets you turn the gateway's output contract into a machine-checkable gate. The closed-world question "did every released-batch reading arrive with a non-Bad quality flag?" is exactly the shape a SHACL (Shapes Constraint Language — the closed-world validator that fails on a missing or out-of-range value, unlike an open-world OWL reasoner) shape expresses, the kind built in Ontologies for Biopharmaceutical Manufacturing, Validation: The Release Gate and SHACL:

# a SHACL shape over the gateway's row contract (illustrative)
ex:SensorReadingShape a sh:NodeShape ;
    sh:targetClass ex:SensorReading ;
    sh:property [ sh:path ex:quality ; sh:minCount 1 ;
                  sh:in ( 192 64 ) ;          # Good or Uncertain — a Bad/0 row fails the gate
                  sh:message "Reading is missing its quality flag or is flagged Bad." ] ;
    sh:property [ sh:path ex:ts ; sh:minCount 1 ; sh:datatype xsd:dateTime ] .

And the lineage walk this batch_id makes possible — "trace every reading back to the cell bank it descends from" — is a SPARQL competency question over ex:derivedFrom, the genealogy spine Ontologies for Biopharmaceutical Manufacturing builds in Conceptualization: Relations and Genealogy and this book loads into a live graph in Semantics and the Digital Thread. The gateway emits the rows; the ontology layer is what lets a regulator ask them a question and get a provable answer.

Lay that next to the database's audit row, audit.change_log from examples/platform/db/50-alcoa.sql, and the missing fields jump out:

-- examples/platform/db/50-alcoa.sql (the regulated audit trail — for contrast)
CREATE TABLE audit.change_log (
    seq        bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    ts         timestamptz NOT NULL DEFAULT clock_timestamp(),
    db_user    text NOT NULL DEFAULT current_user,
    app_user   text,        -- WHO (set via SET app.user)
    table_name text NOT NULL,
    action     text NOT NULL,
    row_key    text,        -- which row (identity of the changed record)
    old_row    jsonb,       -- the value BEFORE
    new_row    jsonb,       -- the value AFTER
    reason     text,        -- WHY (set via SET app.reason)
    prev_hash  text,        -- tamper-evidence: chained to the previous row
    row_hash   text NOT NULL
);

A provenance event has entity, activity, software-agent, time. A Part 11 audit row (Part 11 = FDA 21 CFR Part 11, the US rule for trustworthy electronic records and signatures: an immutable audit trail of who changed what, when, and why) additionally has a human app_user (who), a reason (why), the old_row/new_row before-and-after of a regulated value being changed, and a prev_hash chain that makes tampering evident (each row stores a hash of the previous row, so altering any earlier record breaks every hash after it and the edit cannot pass unnoticed). The provenance event tells you the data flowed; it does not tell you a person changed a controlled value, why they changed it, or prove the record was not edited afterward. That is why we say plainly: provenance is a magnificent data-flow lineage and a real differentiator over a dumb collector, but it is not the regulated audit trail. The audit trail lives in the database, attached by trigger to lab.result, s88.batch, and s88.recipe_parameter — never to the gateway.

And even NiFi's provenance is not a Part 11 audit trail. It tells you what the data flow did; it does not, by itself, give you immutable, signed, who-and-why-with-reason records of a regulated value being changed. That belongs to the database — the system-versioned history and hash chain we build in the trust chapters — not to the gateway. The honest division of labor is: the gateway guarantees the data arrives and records how it flowed; the historian and batch model become the record of truth that is validated, audit-trailed, and signable. No edge tool is the compliant record on its own.

Why the model layer cares which tool you picked

This is a gateway chapter, so the modeling stakes are a premise, not a payload — but the premise is sharper than "models need good data." Three of the gateway's design choices reach all the way up into whether a learned model is valid, and naming the link is worth a paragraph because the failures are otherwise invisible until a soft sensor is already in production.

First, batch_id is the grouping key a model must split on, not just a join key. A soft sensor trained on this historian must be validated by holding out whole batches — scikit-learn's GroupKFold / LeaveOneGroupOut grouped on exactly the batch_id the gateway stamps — because two BR101.DO.PV rows an hour apart in the same run are near-duplicates, and scattering them across a train/test line lets the model memorize within-batch neighbours and report a fantasy R². If the gateway drops or mislabels batch_id, that leak-free split becomes impossible at the source. Machine Learning & AI for Biomanufacturing makes this the first lesson of the whole book — see The Learning Problem and the grouped, leakage-aware splitters in Models and Validation.

Second, the quality flag is an applicability-domain signal, not decoration. A model is only trustworthy on inputs that resemble its training set; a row arriving quality = 0 (Bad) or 64 (Uncertain) is precisely an input the model should refuse rather than silently score. Carrying the flag through untouched at the edge is what lets a downstream sensor gate its own predictions — the Hotelling-T²/SPE applicability-domain check the Models and Validation chapter builds — instead of confidently extrapolating off a fouled probe. Discard the flag at the gateway and every reading looks equally in-domain, which is the same own-goal as the data-integrity one, one layer up.

Third, drift detection lives or dies on the completeness this chapter defends. A population-stability-index monitor watching P(X) — the input distribution a model sees — can only distinguish genuine process drift (a new raw-material lot, a scale move) from data drift (a gateway silently dropping samples under QoS 0, or a transform exception thinning the stream) if the gateway's record is actually complete. An incomplete feed makes a healthy process look like it is drifting and a drifting one look healthy. The whole MLOps loop — drift triggers, locked-model retraining, rollback — sits on the assumption this gateway either earns or breaks; Machine Learning & AI for Biomanufacturing builds that loop in MLOps and Lifecycle. Infrastructure first is not a slogan here; it is the literal precondition for a model being validatable at all.

Why it matters

Everything downstream — dashboards, contextualization, the knowledge graph, the soft-sensor, the audit-trail review — assumes the data arrived complete and correctly labelled. The edge gateway is where that assumption is either earned or quietly broken. Drop the quality flag, pick QoS 0, or use a collector with no chain of custody, and you can build a beautiful platform on top of data you cannot defend. Choose the right tool for each job — Node-RED to prototype, Telegraf to collect at scale, NiFi when the flow must be provable — and the rest of the book has a foundation worth building on.

In the real world

On a real mAb line, the gateway rarely talks to a simulator. It talks to a DCS (Distributed Control System — the plant-wide controller running the process) like Emerson DeltaV or a Siemens controller over OPC UA, to standalone skids over Modbus or S7 (a Siemens controller protocol), and to a commercial historian like AVEVA PI (a vendor time-series database) on the IT side. Those systems do not run on a laptop and are license-locked, which is why our companion repo mocks the OPC UA bioreactor and ships the heavy gateway behind an opt-in profile — the integration code is real, the vendor-specific quirks are not exercised here.

The honest OSS-vs-commercial verdict for this layer: the open-source gateways genuinely do the bridging job well — Node-RED, Telegraf, and NiFi are all production-grade, and NiFi's provenance is a real differentiator. What they do not give you out of the box is the validated-system wrapper: vendor accountability, a turnkey Part 11 audit trail, qualified high availability, and the IQ/OQ/PQ paperwork — Installation, Operational, and Performance Qualification, the documented evidence that the system was installed, runs, and performs as required, the facility-qualification and process-validation lifecycle that Biologic Drug Manufacturing lays out in Tech Transfer and Scale-Up and Quality, Regulatory, and Data. Commercial edge platforms (Ignition, HighByte, the historian vendors' own connectors) sell that wrapper. You can reach roughly the same data outcome with OSS for far less license cost — but the GxP last mile is yours to validate, and that work is real.

Key terms

Edge gateway — the device/software at the OT/IT seam that reads field protocols, transforms data, buffers it, and routes it onward without altering the control loop.
OT / IT — Operational Technology (controllers, skids, sensors on isolated control networks) vs. Information Technology (databases, dashboards, analytics).
Southbound / northbound — gateway-speak for the field-protocol side (down toward devices) vs. the data-platform side (up toward IT).
Node-RED — browser-based low-code flow editor; flows stored as JSON, runs on Node.js; great for prototyping, weak RBAC.
Telegraf — single-binary, plugin-driven, TOML-configured collection agent; deterministic, no per-message audit trail.
Apache NiFi / MiNiFi — JVM dataflow tool with replayable, FlowFile-level provenance (chain of custody); MiNiFi is its small-footprint edge agent.
Provenance / data lineage — the recorded who/what/when/from-what history of a data record; W3C PROV-O is the standard vocabulary (entities, activities, agents).
QoS (MQTT) — delivery guarantee: QoS 0 at-most-once (can lose), QoS 1 at-least-once (can duplicate), QoS 2 exactly-once.
Quality flag — the legacy OPC DA (Classic) status code (e.g. 192 = Good; 64 Uncertain, 0 Bad) carried with each value so consumers can tell good readings from uncertain/bad. (OPC UA-native quality is a StatusCode whose Good is 0; see Chapter 7.)
sensor_reading row (long format) — the gateway's output contract: ts, tag, value, unit, quality, batch_id. One measurement per row, so a new tag is more rows, never a schema change; identity is the (ts, tag) pair.
Provenance event (PROV-O) — a FlowFile lineage record of entity, activity, (software) agent, time. It proves how data flowed; it lacks the human who/why and the before/after of a changed value, so it is not a Part 11 audit trail.
ALCOA+ "Complete" — the data-integrity attribute requiring that no data be silently lost; the gateway's delivery guarantees protect it.
Grouped split / leakage — validating a model by holding out whole batches (grouping on batch_id, e.g. scikit-learn GroupKFold) so near-duplicate within-batch rows never straddle the train/test line and inflate the score; the gateway's batch_id is the grouping key.
Applicability domain — the input region a model is trusted on; a quality-flagged or out-of-range reading is out-of-domain and should be refused, not scored — which is why the gateway must carry the flag through.
Process drift vs. data drift — a real shift in the process (new lot, scale move) versus a gateway artifact (dropped samples, a transform exception); a completeness-preserving gateway is what lets a drift monitor tell them apart.
SHACL shape — a closed-world constraint (Shapes Constraint Language) that fails on a missing or out-of-range value; the gateway's row contract (quality present and not Bad, ts typed) maps directly onto one.

Where this leads

The gateway now reliably routes a clean, quality-flagged stream into the historian. But routing is only as good as the source. In the next chapter, Upstream Capture: The Production Bioreactor, we point the pipeline at the heart of the process — the fed-batch CHO bioreactor itself — and capture its setpoints, process values, OPC UA quality codes, and ISA-88 phase context as the 14-day batch unfolds, turning a stream of rows into a story about a living culture.

What this chapter covers​

The seam: what a gateway actually does​

Southbound, transform, northbound​

Three tools, three jobs​

Node-RED — the low-code flow editor​

Telegraf — declarative collection​

Apache NiFi — guaranteed delivery and replayable provenance​

Where the data lands​

Anatomy of a sensor_reading row: the gateway's output contract​

The honest part: delivery semantics and the audit gap​

Delivery semantics: QoS and where loss actually happens​

Anatomy of a provenance event (and why it is not an audit trail)​

Why the model layer cares which tool you picked​

Why it matters​

In the real world​

Key terms​

Where this leads​