Bridging to Commercial Historians: AVEVA/OSIsoft PI

📍 Where we are: Part IV · Meeting Reality — we leave the all-open-source comfort zone and wire our stack to the one system most plants will never let us replace: the validated commercial historian.

The simple version

Think of the plant's PI System as the official, notarized ledger in the bank vault. We are not allowed to throw it out, and frankly we should not want to — auditors trust it, it has been validated, and the whole site already reads from it. What we are allowed to do is build a fast, friendly photocopier next to the vault: it makes faithful copies of the ledger for our analytics, and when we compute something new, it can hand a copy back through the proper window. This chapter builds that photocopier — the clean boundary between our open-source stack and the commercial historian — and is honest about which side holds the original.

What this chapter covers

For sixteen chapters we built a complete open-source data platform: an OPC UA bioreactor, a Sparkplug bus, a TimescaleDB historian, an ISA-88/95 model in PostgreSQL, contextualization views, and a knowledge graph. It all runs on a laptop. So why does this chapter exist?

Because in a real biopharmaceutical plant, the historian of record is almost never the one you just built. It is a validated AVEVA PI System (the product formerly branded OSIsoft PI), and it is the record-of-truth for years of GMP process data. ("Validated," in this regulatory sense, means the system has been through a formal, documented qualification proving it does what it should and keeps trustworthy records — see Why the record-of-truth stays commercial below.) The honest-hybrid story of this book begins here. This chapter covers:

Why the GMP historian record stays commercial, and why that is the lower-risk choice — not a failure of open source.
The two boundaries you actually engineer: PI Web API (REST) and OPC UA.
A backfill and reconciliation pattern so the OSS copy and the PI original agree.
How to develop and test the bridge against a PI stub when no real PI server is in reach.

cGMP, by the way, is current Good Manufacturing Practice — the FDA's binding expectation that drug manufacturing is controlled, documented, and reproducible — and GxP, used throughout, is the umbrella for all "good practice" regulations (Good Manufacturing, Laboratory, Clinical, and Distribution practice), of which cGMP is the manufacturing slice. Keep that word in mind; it is the reason the vault stays locked.

Why the record-of-truth stays commercial

A PI System is not just a database. It is a PI Data Archive of compressed time-series tags, plus PI Asset Framework (AF) — an asset-centric layer that contextualizes raw tags into hierarchies, attributes, units, and templates [1]. A site has typically spent years validating it: installation qualification, operational qualification, SOPs for change control, a reviewed audit trail. Under quality risk management, ripping that out to chase an open-source historian would be a high-effort, high-risk move for little patient benefit. ICH Q9(R1) is explicit that the effort, formality, and documentation of risk management must be commensurate with the level of risk [2] — and the risk of replacing a validated record system is large.

The FDA's Computer Software Assurance (CSA) guidance points the same way. It encourages a risk-based, least-burdensome approach: spend your assurance effort where the patient risk is, and lean on existing controls elsewhere [3]. That gives us a clean division of labor. The validated PI System carries the high-rigor GxP record. Our open-source layer — TimescaleDB, Grafana, the soft-sensor (a measurement computed from other signals rather than read by a physical probe) — is the lower-risk analytics and edge complement (here "edge" means the equipment-side layer close to the instruments). It computes, visualizes, and explores; it does not have to be the original record.

This is exactly the architecture the digital-twin literature describes (a digital twin is a live data-and-model replica of a physical asset, fed from the plant): the data historian is the integration hub of record, and process data is replicated from it out to cloud and analytics layers over OPC and TCP/IP transports [4]. We are not inventing a pattern. We are implementing the established one, in open source, with our eyes open about who owns the original.

The data-integrity vocabulary makes the boundary precise. MHRA's GxP guidance distinguishes the original record from a true copy [5]. The validated PI System retains the original GxP record, with its full audit trail and dynamic data. Our OSS layer holds true copies and derived data. Say that out loud in a design review and a quality unit will relax: you are not asking them to trust open source with the original.

The OSS↔PI boundary as engineering, not ideology: PI keeps the original GxP record; the open-source layer holds true copies and writes derived values back through a controlled door. Original diagram by the authors, created with AI assistance.

The two boundaries: PI Web API and OPC UA

There are exactly two doors into a modern PI System, and you will use both.

Door one — PI Web API. This is a RESTful interface giving client applications read and write access to PI Data Archive and AF data over HTTPS [6]. It speaks JSON, it is firewall-friendly, and it is the natural fit for a Python client or an Apache NiFi flow. You request a stream's recorded values over a time window; you POST new values to a point.

Door two — OPC UA. PI ingests from OPC UA servers through the PI Connector/Adapter for OPC UA, which makes OPC UA a first-class read/write boundary between PI and the edge [7]. Because OPC UA is platform-independent, secure, and firewall-friendly [8], and because our bioreactor already speaks it, this door costs us almost nothing — our existing opcua-server and Telegraf's OPC UA input plugin [9] plug straight in. The subtle, important part is history: OPC UA Part 11, Historical Access, defines how a client reads and writes historical data and events, not just live values [10]. That is what makes gap-filling against PI possible, and it is the backbone of the reconciliation pattern below.

A rule of thumb from the field: use OPC UA when PI should pull live data from us (PI stays the collector, the validated path is unchanged), and use PI Web API when we pull from PI for analytics or write a computed value back. The first keeps PI's ingestion validated and untouched; the second keeps our analytics loop fast.

Talking to PI Web API: the request shape

Since AVEVA PI cannot run on a laptop and ships no public image, the honest-hybrid design is to develop against a mock rather than the real thing — the same approach the book takes for SAP (an enterprise resource-planning system) and DeltaV (a distributed control system, or DCS). That mock is now shipped: a small FastAPI service at examples/services/pi-web-api-stub/ that honors the same request and response contract a real PI Web API does, serves our golden batch, and accepts writes. The companion repo's compose.yaml runs it under a commercial profile (alongside the core, capture, semantics, and analytics/ops profiles); the DeltaV dcs-mock from the next chapter now ships under that same commercial profile, leaving only the sap-mock on the roadmap. So treat the PI snippets below as the request/response contract — illustrative shapes, clearly labelled — while the stub that serves them and the bridge that parses them are real, tested code (examples/chapters/17-bridge-pi-historian/pi_bridge.py, exercised by examples/tests/test_bridges.py). The loader on our side of the boundary, shown later in this chapter, is real too.

PI Web API addresses everything by an opaque WebId. You resolve a tag (a "PI Point") to its WebId once, then read its recorded values over a window. A recorded-values response would look like this (illustrative PI Web API JSON, matching AVEVA's documented schema — the shape your mock would serve and your bridge would parse):

{
  "Items": [
    { "Timestamp": "2026-01-18T23:58:00Z", "Value": 5.8214, "UnitsAbbreviation": "g/L",
      "Good": true, "Questionable": false, "Substituted": false },
    { "Timestamp": "2026-01-18T23:59:00Z", "Value": 5.7589, "UnitsAbbreviation": "g/L",
      "Good": true, "Questionable": false, "Substituted": false }
  ]
}

Two things to notice. First, those titer values — 5.8214 and 5.7589 g/L (titer = the concentration of the secreted antibody product, in grams per litre) — are the exact last two readings of BR101.Titer.PV for our golden batch BATCH-2026-001, taken from the 1-minute trace in datasets/fedbatch_timeseries.parquet (the very file the loader below reads), which ends at 23:59. (The apparent dip from 5.8214 to 5.7589 g/L is sensor noise on a roughly 0.03 g/L in-line measurement band, not a real drop — accumulated titer only rises in a fed-batch, and ~5.8 g/L is a solid late-harvest endpoint for a 14-day fed-batch CHO (Chinese Hamster Ovary) mAb (monoclonal antibody) run; reconciliation must reproduce the noisy reading exactly, dip and all.) PI and our stack are looking at the same physical bioreactor; they had better agree to the digit. Second, PI Web API surfaces quality as Good/Questionable/Substituted flags rather than a single numeric OPC quality code. Our historian stores OPC-style quality (192 for Good, 64 for Uncertain). Bridging the two is a small mapping table you write once and validate — never an afterthought.

Writing back is the same door in reverse: a POST to a point's recorded-values endpoint with a JSON body of {Timestamp, Value} items. This is how a computed tag — say BR101.Titer.SoftSensor from the Chapter 29 Raman model — re-enters the PI world so operators see it next to the instrument tags they trust.

Anatomy of a PI Web API recorded-values Item

The whole bridge turns on one small JSON object, repeated once per reading inside that Items array. It is worth dissecting in full, because every field either survives the crossing into our historian or is deliberately collapsed — and a bridge author who cannot say which is which will silently lose quality. The object the shipped stub emits is built field-for-field in services/pi-web-api-stub/app.py (the recorded() handler), and the object the bridge consumes is read field-for-field in pi_bridge.to_sensor_rows(). Here is the second Item from the response above — the golden batch's last reading — laid against the row it becomes:

The addressing chain comes first, before any value. PI Web API never lets you ask for BR101.Titer.PV by name on the hot path. You resolve the PI Point name — its fully-qualified form is \\PISRV\BR101.Titer.PV — to an opaque WebId once, then read the stream by that handle. The shipped webid() is not a black box: it is base64.urlsafe_b64encode(point.encode()).decode().rstrip("="), so BR101.Titer.PV resolves to exactly QlIxMDEuVGl0ZXIuUFY — base64url of the tag bytes, trailing = padding stripped. The stub's _point() reverses it (re-padding to a multiple of four before decoding). A real PI WebId is a longer, server-minted token, but the contract is identical: name in, opaque handle out, read by handle.
Timestamp — an ISO-8601 UTC instant. 2026-01-18T23:59:00Z. The trailing Z is load-bearing: it pins the reading to UTC so no daylight-saving or site-timezone ambiguity can creep in when it lands in a timestamptz column. The stub formats it with strftime("%Y-%m-%dT%H:%M:%SZ"), which is why our minute marks have no sub-second part.
Value — the measurement. 5.7589, the real last titer of BATCH-2026-001 (lower than the prior 5.8214 only because of in-line measurement noise — the true accumulated titer never falls). The stub rounds to four places (round(float(v), 4)), which is exactly why the contract test asserts the served values equal the parquet tail rounded to four places — agreement is defined to the digit the wire actually carries, not to a precision the wire never had.
UnitsAbbreviation — the unit, inline. g/L. PI carries the unit on every Item, so a value is never a bare number on this wire any more than it is in our historian.
Good / Questionable / Substituted — quality, as three booleans. This is the field that does not map one-to-one. PI surfaces quality as three independent flags; our historian stores a single OPC-style smallint. The next section dissects the collapse.

One PI Web API Item, fully unpacked: the addressing chain resolves a name to a WebId, then the bridge maps Timestamp, Value, unit, and three quality booleans into one row of ts.sensor_reading. Original diagram by the authors, created with AI assistance.

The reason to draw it as a card is the same reason the OPC UA chapter draws a node as a card: the discipline of the bridge is to account for every field, name where it lands, and be honest about the one field (Substituted) our six columns have no slot for.

The same row, as a triple the digital thread can walk

That six-column row is also a row in the knowledge graph the previous chapter built. The graph's data model is RDF — every fact is a triple of subject, predicate, object — and the reading lands as a small bundle of them, the same shape Semantics & the Digital Thread emits for a release result. Written in Turtle (the human-readable RDF syntax), the 5.7589 g/L Item becomes:

@prefix bp:   <https://example.org/bioproc#> .
@prefix qudt: <http://qudt.org/schema/qudt/> .
@prefix prov: <http://www.w3.org/ns/prov#> .

bp:reading-BR101-Titer-2026-01-18T2359
    a              bp:SensorReading ;
    bp:ofTag       "BR101.Titer.PV" ;
    bp:value       "5.7589"^^xsd:float ;
    qudt:unit       unit:GM-PER-L ;       # g/L, machine-readable, not a bare string
    bp:quality     192 ;                  # OPC code, after the crosswalk
    bp:fromBatch   bp:BATCH-2026-001 ;
    prov:wasDerivedFrom bp:pi-point-BR101-Titer-PV .   # PI holds the original

Three things the bridge already does map cleanly onto formal semantics. First, bp:fromBatch is the same edge the genealogy uses, so this reading hangs off the bp:derivedFrom-rooted batch node and a single (bp:derivedFrom)+ SPARQL property path walks from it back to the working cell bank — the derivedFrom transitive spine Book 4 builds. Second, the qudt:unit is why the bridge never lets a value travel as a bare number: the unit is a typed fact, exactly the identifiers-and-units discipline. Third — and this is the field the relational row cannot carry — PROV-O's prov:wasDerivedFrom records that the OSS copy was derived from the PI original, turning the MHRA "true copy vs original record" distinction from prose into a machine-checkable provenance edge.

The reconciliation contract has an even tighter ontology reading: the rule "every reading must carry exactly one value, one unit, and one quality code, and quality must be drawn from {192, 64, 0}" is a SHACL sh:NodeShape — sh:path bp:quality ; sh:in ( 192 64 0 ) ; sh:minCount 1 ; sh:maxCount 1 — the same release-gate-and-SHACL shape language that gates a batch's CQAs. And "does the OSS copy reproduce the PI original to the digit?" is a textbook competency question (a question the model must be able to answer, paired with its expected result): for tag T over window W, do the OSS and PI values agree within tolerance? — the zero-row reconciliation query below is precisely that CQ, run as a PASS/FAIL test, the discipline Book 4 catalogues in competency-questions-as-queries. The bridge does not need a triplestore to be correct, but the meaning it enforces — typed units, lineage, provenance, a shape — is exactly the formal vocabulary, which is why the same row drops into the graph without translation.

Where this row comes from — the trilogy spine

That 5.7589 g/L titer began as a physical event: a sample drawn from the production bioreactor in Book 1, where a CHO culture actually secretes the antibody this number measures. Book 2 frames the same reading as a data-point that has to travel — the connectivity standards (OPC UA, PI Web API) that carry it off the floor, and the open challenge of automation and control data landing faithfully in a historian. This chapter is the code that closes that loop: the bridge and the ts.sensor_reading row are where the physical measurement and the data-point become a tested, reconciled artifact.

Anatomy of a PI Asset Framework element: where a WebId comes from

A WebId looks arbitrary, but in a real plant it is the address of something richly contextualized. PI is not just the PI Data Archive of raw tags; it is PI Asset Framework (AF), an asset-centric layer that wraps those tags in hierarchy, attributes, units, and templates [1]. The chain behind a single WebId is three links deep:

Element — an AF object that models a piece of the plant. For us that is the bioreactor BR101, an instance of an AF element template so that every bioreactor on the site exposes the same attributes by the same names. This is AF's job: contextualize raw tags into a hierarchy a human can navigate, exactly the role our own ISA-88/95 model plays on the open-source side.
Attribute — a named, typed property hanging off the element: Titer, with a unit of measure (g/L) and metadata. The attribute is what an analytics client browses to; it is the AF-side analogue of an OPC UA Variable's BrowseName.
PI Point — the raw archived tag the attribute points at: BR101.Titer.PV in the Data Archive. The attribute is the contextualized face; the PI Point is the compressed time-series underneath. The WebId you resolve can address any of the three — element, attribute, or the underlying point's stream — which is why a real PI exposes endpoints for points, streams, and AF; the shipped stub serves the two that carry our golden batch (the points resolve and the streams/.../recorded read and write), and the AF layer is described here for fidelity rather than served.

The payoff: when you write a bridge against the attribute rather than the bare PI Point, a unit or a tag rename on the PI side does not silently break your read, because AF holds the contextualization the digital-twin literature expects of the historian-of-record [4]. The shipped stub keeps this lighter — it resolves a PI Point name directly and serves the PIPoint-resolve and Stream-recorded endpoints (AF is described, not served) — but those are the same shapes a real PI exposes, so the contract you test against is the contract you meet.

Resolving a tag: from PI Point name to WebId

Before any value moves, exactly one round-trip resolves identity. The bridge calls the stub's GET /piwebapi/points?path=\\PISRV\BR101.Titer.PV; the handler takes the last \-separated segment as the tag, computes _webid(tag), and returns {WebId, Name, Path}. From then on every read is GET /piwebapi/streams/{webId}/recorded. Three properties of this make it safe to develop against the mock:

The WebId is deterministic for the mock. Because webid() is a pure function of the tag string, the bridge and the stub agree on the handle without a registration step — to_sensor_rows() never has to ask the stub "what tag was this?", because the caller already knows the tag it resolved.
The reverse map exists. The stub's _point() decodes the WebId back to the tag so its recorded() handler can filter the golden trace — df[df.tag == tag]. Against a real PI the WebId is opaque and you would not decode it; you would carry the tag name alongside, exactly as to_sensor_rows(items, tag, batch_id) already does. The bridge does not depend on the WebId being decodable.
A missing tag is a 404, not a silent empty. If the resolved tag has no rows, the stub raises HTTPException(404, f"no PI Point for {tag}"). A bridge that swallowed that into an empty backfill would quietly leave a gap; surfacing it is the contract.

Where the OSS copy lands: load it like any other source

On our side of the boundary, a value pulled from PI is just another time-series row. It lands in the same ts.sensor_reading hypertable that every capture chapter writes to, with the same six columns. The companion repo's loader shows the exact shape and the exact COPY path — this is from examples/tools/load_datasets.py:

def load_timeseries(conn) -> int:
    df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
    buf = io.StringIO()
    df[["ts", "tag", "value", "unit", "quality", "batch_id"]].to_csv(buf, index=False, header=False)
    buf.seek(0)
    with conn.cursor() as cur:
        cur.execute("TRUNCATE ts.sensor_reading")
        with cur.copy("COPY ts.sensor_reading (ts, tag, value, unit, quality, batch_id) "
                      "FROM STDIN WITH (FORMAT csv)") as copy:
            copy.write(buf.read())
    return len(df)

The columns are the contract: ts, tag, value, unit, quality, batch_id. A PI bridge does nothing more exotic than this — it turns PI Web API JSON into rows of exactly this shape and COPYs them in. The header of that same file states the design plainly:

This is the path Chapters 7–16 build up piece by piece; here it is one script so
the contextualization and ALCOA+ chapters have data to query. Idempotent: it
truncates the loaded tables first.

The word idempotent is the whole game for a bridge. Re-running the loader leaves the table in the same state, because it truncates first. A production PI bridge cannot truncate the world every run, so it earns idempotency a different way. Note one constraint the shipped schema imposes: platform/db/20-historian.sql builds ts.sensor_reading as a TimescaleDB hypertable with only non-unique indexes on (tag, ts DESC) and (batch_id, ts DESC) — there is no primary key or unique constraint on (tag, ts). So a naive INSERT ... ON CONFLICT (tag, ts) would fail against the repo as built. The mechanism the current schema does support is delete-window-then-insert: inside one transaction, DELETE the rows for the tags and time window you are about to backfill, then INSERT the freshly read values. Re-reading an overlapping window then never duplicates a row. (If you want a true upsert, you would first add a UNIQUE (tag, ts) index — and on a hypertable that index must include the partitioning column ts.) Same goal, different mechanism. The CSV the loader consumes is the same long format you saw the simulator emit (these are the last two 1-minute rows of the golden batch, the tail of fedbatch_timeseries.parquet):

ts,tag,value,unit,quality,batch_id
2026-01-18 23:58:00+00:00,BR101.Titer.PV,5.8214,g/L,192,BATCH-2026-001
2026-01-18 23:59:00+00:00,BR101.Titer.PV,5.7589,g/L,192,BATCH-2026-001

Notice the round-trip closes: the PI Web API JSON above and this CSV row describe the same measurement. The bridge's only job is to keep that true.

Backfill and reconciliation: making the two copies agree

Live streaming is the easy 90%. The hard, GxP-relevant 10% is what happens after the network was down for two hours. The OSS copy now has a gap the PI original does not, because PI's validated collector buffered through the outage. This is precisely why OPC UA Historical Access matters [10]: the bridge can ask PI for the historical values across the missing window and fill the hole.

The four-move backfill loop

A robust backfill loop has four moves:

Find the gap. Query our historian for the latest ts per tag; anything newer in PI is missing here.
Read history. Pull PI Web API recorded values (or OPC UA HA reads) for [last_ours, now].
Replace the window, do not blindly append. In one transaction, DELETE the rows for the affected tags and [last_ours, now] window, then INSERT the freshly read values — so an overlapping window is safe to re-pull without duplicating rows. (Switch to INSERT ... ON CONFLICT only after you add a UNIQUE (tag, ts) index, which the shipped hypertable does not yet carry.)
Reconcile. Re-compare a sample of overlapping points and assert they match within tolerance; log any divergence as a data-integrity event, not a silent overwrite.

Mapping quality: PI's three booleans to one OPC code

Step 2 hides the bridge's one genuinely lossy seam, and it deserves a figure of its own because it is the only place where the round-trip is not loss-free. PI describes quality with three independent booleans on every Item — Good, Questionable, Substituted. Our ts.sensor_reading.quality is a single smallint defaulting to 192, carrying the OPC-style codes the rest of the platform already speaks (192 Good, 64 Uncertain, 0 Bad). The shipped quality_code() is the whole crosswalk, and its order matters: it checks Questionable first (so an uncertain point can never be rounded up to Good), then returns 192 for Good and 0 otherwise.

Two honest consequences fall out of that table. First, Substituted has nowhere to land — our six columns carry no slot for "this value was filled in by the collector" — so the bridge drops it. That is a documented loss, not an accident, and a quality reviewer should know it is the one PI fact the OSS copy does not retain. Second, because the contract test asserts rows[0][4] == 192 for the golden batch's Good readings, the crosswalk is not a comment in a design doc; it is an executable assertion that ships with the bridge.

Apache NiFi is the natural home for this when you want a visual, replayable flow with provenance: its InvokeHTTP processor is an HTTP client that calls a configurable endpoint and sends a FlowFile body as the request, supporting GET for reads and PUT/POST/PATCH for writes [11] — exactly the verbs PI Web API uses. NiFi's provenance then records who pulled what, when, and from where, which is gold during an investigation. The reconciliation step itself is plain SQL against the two copies (illustrative — stage.pi_recorded is the staging table a real bridge would land its PI read into; the shipped test makes the equivalent value-agreement assertion in Python, tests/test_bridges.py):

-- Reconcile the OSS copy against PI for one tag/window.
-- Flags any point where the copies disagree by more than tolerance.
SELECT o.ts, o.value AS oss_value, p.value AS pi_value,
       abs(o.value - p.value) AS delta
FROM   ts.sensor_reading o
JOIN   stage.pi_recorded  p ON p.tag = o.tag AND p.ts = o.ts
WHERE  o.tag = 'BR101.Titer.PV'
  AND  o.batch_id = 'BATCH-2026-001'
  AND  abs(o.value - p.value) > 1e-6;   -- expect zero rows

If that query returns rows, your copy drifted — perhaps a unit-conversion bug, perhaps a clock skew. Returning zero rows is the property this query exists to prove; the shipped test (tests/test_bridges.py) makes the equivalent assertion in Python — that the served PI values equal the parquet tail to the digit, and that Good maps to 192 — and it is the engineering proof that the true copy is faithful to the original.

Testing the bridge with no PI in reach

You will write almost all of this code with no PI server anywhere near you, and that is fine — if you test against a contract rather than a vibe. The honest-hybrid design, now shipped in the repo, is the pi-web-api-stub that serves the PIPoint-resolve and Stream-recorded endpoints a real PI exposes (the AF endpoint is described, not served), serves the golden data, and accepts writes. Because it is FastAPI, it auto-exposes /openapi.json when it runs; pin a contract test to that spec — a fuzzing tool such as schemathesis can then exercise both the stub and the bridge against the same schema a real PI honors. The point the book makes plainly: the goal is a bridge that is real and tested against a contract, with the PI counterpart mocked. When you reach a site with a real PI, you change a base URL and a credential, run the same contract test against the real server, and you are done. The artifacts in this chapter that run today are the bridge (examples/chapters/17-bridge-pi-historian/pi_bridge.py) and its test against the stub (examples/tests/test_bridges.py), plus the loader above (examples/tools/load_datasets.py).

A minimal contract for the read path — illustrative, the shape FastAPI generates for the shipped stub's read endpoint:

# pi-web-api-stub — read path contract (illustrative)
paths:
  /piwebapi/streams/{webId}/recorded:
    get:
      parameters:
        - { name: startTime, in: query, schema: { type: string } }   # e.g. "*-2h"
        - { name: endTime,   in: query, schema: { type: string } }   # e.g. "*"
      responses:
        "200": { description: Recorded values, content: { application/json: {} } }

This is the honest core of the whole chapter: we cannot run AVEVA PI on a laptop, so we do not pretend to. We pin the contract, test ruthlessly against it, and document the one-line swap to reality.

The soft sensor that writes back: what the bridge owes a model

The write-back path — POSTing BR101.Titer.SoftSensor from the Chapter 29 Raman model back into PI — looks like a one-way pipe, but it is also the seam where a machine-learning model meets the validated record, and a few disciplines ride on it that the bridge must respect or quietly break.

The first is honest validation. A soft sensor (a value computed from other signals, here titer inferred from a Raman spectrum) is only trustworthy if it was validated on batches it had never seen — and the unit of "never seen" is the batch, not the row. Sibling batches off the same working cell bank share a media lot and a seed train, so they are near-twins; split readings row-wise at random and a near-twin lands on both sides of the train/test line, the model effectively sees the answer, and its reported score is fantasy. The fix is a grouped, leave-one-batch-out split — every reading from a batch goes wholly to train or wholly to test — and the batch_id column the bridge faithfully carries on every row is that grouping key. The genealogy that traces a deviation and the genealogy that defines an honest validation fold are the same thread; Book 5 turns it into GroupKFold and nested cross-validation in models-and-validation, and makes the batch-grouped split the default in data-the-fuel.

The second is the applicability domain — a gate that asks whether a new reading even resembles the data the model was trained on, so the soft sensor can decline to guess out of its depth rather than emit a confident wrong number. The perfusion variant in In the real world below is exactly such an out-of-domain case: a model calibrated on the 14-day fed-batch sees a high steady-state cell density and continuous-harvest tags (PBR201.Harvest.Titer) it never trained on. A soft sensor that writes back a titer for a regime outside its trained envelope is writing fiction into the validated record. The learning-problem chapter frames this gate; the bridge enforces it by refusing to POST a value the model has flagged out-of-domain.

The third is the distinction the reconciliation loop exists to keep straight: process drift versus model drift. The living cells genuinely wander batch to batch — that is process drift, a real manufacturing signal the digital thread must faithfully preserve, never smooth away. A soft sensor going stale against that moving process is model drift, a defect to detect and act on. Conflating the two is how a monitor either cries wolf or misses a real shift, the mlops-and-lifecycle governance Book 5 builds. The bridge serves both: it is the governance data source SPC, multivariate models, and the soft sensor read from, and the place a model's lineage — which dataset hash, which model version produced BR101.Titer.SoftSensor — should be recorded as first-class provenance, so a later audit can walk from a released lot back to the exact frozen model that touched it. The same hybrid-and-physics reasoning, where a mechanistic mass balance constrains the data-driven estimate, is the hybrid-models-and-digital-twins story. A bridge that loses batch_id, drops provenance, or writes back an out-of-domain guess does not just lose data — it poisons the very models the OSS layer exists to feed.

Why it matters

Get this boundary wrong and you face one of two failures. Either you try to make the open-source historian the GMP record — and inherit a validation, audit-trail, and 21 CFR Part 11 (FDA's electronic-records and electronic-signatures rule) burden that no OSS historian carries out of the box, for no patient benefit. Or you let the two copies silently diverge, and your shiny Grafana dashboard quietly disagrees with the record auditors will actually read.

What the inspectors actually find

This is not a hypothetical risk; it is one of the most-cited deficiencies in the published inspection record. A retrospective analysis of FDA Warning Letters issued to pharmaceutical companies from 2010 to 2020 found that documentation and data-integrity problems were a dominant cGMP deficiency category — cited as a major deficiency in roughly 20–25% of letters on average, and accounting for about 21% of cGMP warning letters in the period studied [13]. The recurring failure mode is precisely the one this chapter engineers against: a copy or a derived record that cannot be shown to be a faithful, attributable reflection of the original — the ALCOA+ data-integrity attributes (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available) inspectors apply. That is why the MHRA vocabulary of original record versus true copy exists [5], and why the reconciliation query above — the one that must return zero rows — is not box-ticking but the documented proof that the OSS true copy still equals the PI original.

The field lesson is blunt: the moment your copy can drift from the original without an alarm, you have built exactly the deficiency inspectors write up. The four-move loop, the delete-window-then-insert idempotency, the quality crosswalk, and the zero-row reconciliation assertion are each a small insurance policy against landing in that 21%.

Getting it right is liberating. The validated PI System keeps doing its job — original record, full audit trail, the thing the quality unit signs off [5]. Your open-source layer gets fast, cheap, unconstrained access to true copies for SPC, multivariate models, and soft sensors, and can hand computed insight back through a controlled door. GAMP 5's second edition is explicit that open source belongs in GxP — inside a validated lifecycle, with supplier and risk assessment proportionate to use [12]. The bridge is the seam where that lifecycle is drawn.

In the real world

Walk into almost any approved-product mAb plant and you will find a PI System at the center, surrounded by DCS (distributed control system) collectors feeding it and analytics tools reading from it. The pattern in this chapter is not a teaching simplification; it is how integration teams actually live. The fed-batch CHO + Protein A line we model is the dominant modality for approved monoclonal antibodies, and its data has been landing in PI historians for two decades.

The intensified/continuous variant of our process — perfusion with multi-column capture — only sharpens the point: a perfusion run holds a high steady-state viable-cell density and adds continuous tags the batch never has — PBR201.CSPR.PV, PBR201.Perfusion.Rate, PBR201.CellBleed.Rate, PBR201.Harvest.VCD, PBR201.Harvest.Titer (all in datasets/perfusion_timeseries.parquet). More sensors, more tags, more reason to want cheap open-source analytics, and exactly the same reason to leave the validated historian alone.

The honest OSS-vs-commercial verdict for this layer: open source gets you a superb, scalable historian (TimescaleDB), world-class dashboards (Grafana), and a free, flexible bridge (Telegraf [9], NiFi [11], a Python PI Web API client). What it does not give you is a turnkey, validated, Part-11-ready system of record with a vendor on the hook when an inspector calls. PI gives you that. So the lower-risk, compliant move is the hybrid: PI keeps the original; OSS does the thinking [3][2]. Pure open source gets you about 80% of the platform; this is one of the seams where the last GxP mile is, honestly, hybrid.

Key terms

Historian / record-of-truth — the system that holds the official, audit-trailed process record. In most plants this is a validated AVEVA/OSIsoft PI System, not the OSS historian.
PI Web API — AVEVA's RESTful, HTTPS read/write interface to PI Data Archive and Asset Framework data [6].
PI Asset Framework (AF) — the asset-centric layer that contextualizes raw PI tags into hierarchies, attributes, and units [1].
WebId — the opaque identifier PI Web API uses to address a point, stream, or AF element. In the shipped stub it is base64url(tag) with padding stripped, so BR101.Titer.PV resolves to QlIxMDEuVGl0ZXIuUFY; a real PI mints a longer server token, but the resolve-then-read contract is the same.
PI Point — the raw, compressed time-series tag in the PI Data Archive (e.g. BR101.Titer.PV). An AF Attribute contextualizes it (Element → Attribute → PI Point); bridging against the attribute survives a tag or unit rename, bridging against the bare point does not.
Recorded-values Item — one element of the Items array a PI Web API recorded-values read returns: Timestamp, Value, UnitsAbbreviation, and the three quality booleans Good/Questionable/Substituted. The unit of mapping into a single ts.sensor_reading row.
OPC UA Historical Access (HA) — OPC UA Part 11; reading and writing historical values and events, the basis for backfill [10].
Backfill — filling a gap in the OSS copy by reading the missing window from the PI original after an outage.
Reconciliation — comparing the two copies over an overlapping window and flagging divergence rather than silently overwriting.
Idempotent — an operation safe to repeat; against the shipped historian (which has no unique constraint on (tag, ts)), a bridge achieves it by deleting the affected window and re-inserting, so re-reading never duplicates rows.
True copy vs original record — MHRA terms; the OSS layer holds true copies/derived data, PI retains the original [5].
CSA (Computer Software Assurance) — FDA's risk-based, least-burdensome assurance approach that lets you apply lighter rigor to the OSS analytics layer [3].
cGMP — current Good Manufacturing Practice; the binding expectation of controlled, documented, reproducible drug manufacturing.
GxP — "Good x Practice": the umbrella for the regulated good-practice rules (GMP, GLP, GDP, …), of which cGMP is the manufacturing slice; a "GxP record" is one those rules govern.
21 CFR Part 11 — FDA's rule for trustworthy electronic records and signatures; the validation, audit-trail, and assurance burden an OSS historian does not carry out of the box. Distinct from the unrelated OPC UA Part 11 (Historical Access) above.
SPC — Statistical Process Control; charting a process variable over time to catch drift.
ALCOA+ — the data-integrity attributes inspectors apply to a record: Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available.
PROV-O / provenance edge — the W3C provenance ontology; a prov:wasDerivedFrom triple records that the OSS copy was derived from the PI original, turning the MHRA "true copy vs original record" distinction into a machine-checkable fact rather than prose.
Grouped (leave-one-batch-out) split — the validation discipline of putting every reading from a batch wholly on one side of the train/test line, so a soft sensor is scored on genuinely unseen lots; the bridge's batch_id column is the grouping key, and a row-wise random split reports a fantasy score.
Applicability domain — a gate that asks whether a new reading resembles the model's training data, so a soft sensor can decline to guess out of its depth (e.g. a fed-batch model facing a perfusion regime) rather than write a confident wrong value back into the record.

Where this leads

The historian was the friendliest commercial system to bridge, because it speaks open protocols and trades in time-series we already understand. The next chapter, Bridging to DCS, MES & ERP: DeltaV, Siemens, SAP, walks into harder territory — control systems over OPC UA, the blunt verdict that there is no credible open-source GxP MES, and ERP exchange of materials, lots, and work orders via B2MML/ISA-95 messages — where the honest-hybrid boundary is drawn not by preference but by necessity.

What this chapter covers​

Why the record-of-truth stays commercial​

The two boundaries: PI Web API and OPC UA​

Talking to PI Web API: the request shape​

Anatomy of a PI Web API recorded-values Item​

The same row, as a triple the digital thread can walk​

Anatomy of a PI Asset Framework element: where a WebId comes from​

Resolving a tag: from PI Point name to WebId​

Where the OSS copy lands: load it like any other source​

Backfill and reconciliation: making the two copies agree​

The four-move backfill loop​

Mapping quality: PI's three booleans to one OPC code​

Testing the bridge with no PI in reach​

The soft sensor that writes back: what the bridge owes a model​

Why it matters​

What the inspectors actually find​

In the real world​

Key terms​

Where this leads​