Bridging to Commercial Historians: AVEVA/OSIsoft PI
๐ Where we are: Part IV ยท Meeting Reality โ we leave the all-open-source comfort zone and wire our stack to the one system most plants will never let us replace: the validated commercial historian.
Think of the plant's PI System as the official, notarized ledger in the bank vault. We are not allowed to throw it out, and frankly we should not want to โ auditors trust it, it has been validated, and the whole site already reads from it. What we are allowed to do is build a fast, friendly photocopier next to the vault: it makes faithful copies of the ledger for our analytics, and when we compute something new, it can hand a copy back through the proper window. This chapter builds that photocopier โ the clean boundary between our open-source stack and the commercial historian โ and is honest about which side holds the original.
What this chapter coversโ
For sixteen chapters we built a complete open-source data platform: an OPC UA bioreactor, a Sparkplug bus, a TimescaleDB historian, an ISA-88/95 model in PostgreSQL, contextualization views, and a knowledge graph. It all runs on a laptop. So why does this chapter exist?
Because in a real biopharmaceutical plant, the historian of record is almost never the one you just built. It is a validated AVEVA PI System (the product formerly branded OSIsoft PI), and it is the record-of-truth for years of GMP process data. The honest-hybrid story of this book begins here. This chapter covers:
- Why the GMP historian record stays commercial, and why that is the lower-risk choice โ not a failure of open source.
- The two boundaries you actually engineer: PI Web API (REST) and OPC UA.
- A backfill and reconciliation pattern so the OSS copy and the PI original agree.
- How to develop and test the bridge against a PI stub when no real PI server is in reach.
cGMP, by the way, is current Good Manufacturing Practice โ the FDA's binding expectation that drug manufacturing is controlled, documented, and reproducible. Keep that word in mind; it is the reason the vault stays locked.
Why the record-of-truth stays commercialโ
A PI System is not just a database. It is a PI Data Archive of compressed time-series tags, plus PI Asset Framework (AF) โ an asset-centric layer that contextualizes raw tags into hierarchies, attributes, units, and templates [1]. A site has typically spent years validating it: installation qualification, operational qualification, SOPs for change control, a reviewed audit trail. Under quality risk management, ripping that out to chase an open-source historian would be a high-effort, high-risk move for little patient benefit. ICH Q9(R1) is explicit that the effort, formality, and documentation of risk management must be commensurate with the level of risk [2] โ and the risk of replacing a validated record system is large.
The FDA's Computer Software Assurance (CSA) guidance points the same way. It encourages a risk-based, least-burdensome approach: spend your assurance effort where the patient risk is, and lean on existing controls elsewhere [3]. That gives us a clean division of labor. The validated PI System carries the high-rigor GxP record. Our open-source layer โ TimescaleDB, Grafana, the soft-sensor โ is the lower-risk analytics and edge complement. It computes, visualizes, and explores; it does not have to be the original record.
This is exactly the architecture the digital-twin literature describes: the data historian is the integration hub of record, and process data is replicated from it out to cloud and analytics layers over OPC and TCP/IP transports [4]. We are not inventing a pattern. We are implementing the established one, in open source, with our eyes open about who owns the original.
The data-integrity vocabulary makes the boundary precise. MHRA's GXP guidance distinguishes the original record from a true copy [5]. The validated PI System retains the original GxP record, with its full audit trail and dynamic data. Our OSS layer holds true copies and derived data. Say that out loud in a design review and a quality unit will relax: you are not asking them to trust open source with the original.
The OSSโPI boundary as engineering, not ideology: PI keeps the original GxP record; the open-source layer holds true copies and writes derived values back through a controlled door. Original diagram by the authors, created with AI assistance.
The two boundaries: PI Web API and OPC UAโ
There are exactly two doors into a modern PI System, and you will use both.
Door one โ PI Web API. This is a RESTful interface giving client applications read and write access to PI Data Archive and AF data over HTTPS [6]. It speaks JSON, it is firewall-friendly, and it is the natural fit for a Python client or an Apache NiFi flow. You request a stream's recorded values over a time window; you POST new values to a point.
Door two โ OPC UA. PI ingests from OPC UA servers through the PI Connector/Adapter for OPC UA, which makes OPC UA a first-class read/write boundary between PI and the edge [7]. Because OPC UA is platform-independent, secure, and firewall-friendly [8], and because our bioreactor already speaks it, this door costs us almost nothing โ our existing opcua-server and Telegraf's OPC UA input plugin [9] plug straight in. The subtle, important part is history: OPC UA Part 11, Historical Access, defines how a client reads and writes historical data and events, not just live values [10]. That is what makes gap-filling against PI possible, and it is the backbone of the reconciliation pattern below.
A rule of thumb from the field: use OPC UA when PI should pull live data from us (PI stays the collector, the validated path is unchanged), and use PI Web API when we pull from PI for analytics or write a computed value back. The first keeps PI's ingestion validated and untouched; the second keeps our analytics loop fast.
Talking to PI Web API: the request shapeโ
Since AVEVA PI cannot run on a laptop and ships no public image, the honest-hybrid design is to develop against a mock rather than the real thing โ the same approach the book takes for SAP and DeltaV. That mock is now shipped: a small FastAPI service at examples/services/pi-web-api-stub/ that honors the same request and response contract a real PI Web API does, serves our golden batch, and accepts writes. The companion repo's compose.yaml runs it under a commercial profile (alongside the existing core, semantics, and analytics/ops profiles); only the still-unbuilt dcs-mock and sap-mock from later chapters remain on the roadmap. So treat the PI snippets below as the request/response contract โ illustrative shapes, clearly labelled โ while the stub that serves them and the bridge that parses them are real, tested code (examples/chapters/17-bridge-pi-historian/pi_bridge.py, exercised by examples/tests/test_bridges.py). The loader on our side of the boundary, shown later in this chapter, is real too.
PI Web API addresses everything by an opaque WebId. You resolve a tag (a "PI Point") to its WebId once, then read its recorded values over a window. A recorded-values response would look like this (illustrative PI Web API JSON, matching AVEVA's documented schema โ the shape your mock would serve and your bridge would parse):
{
"Items": [
{ "Timestamp": "2026-01-18T23:58:00Z", "Value": 5.8214, "UnitsAbbreviation": "g/L",
"Good": true, "Questionable": false, "Substituted": false },
{ "Timestamp": "2026-01-18T23:59:00Z", "Value": 5.7589, "UnitsAbbreviation": "g/L",
"Good": true, "Questionable": false, "Substituted": false }
]
}
Two things to notice. First, those titer values โ 5.8214 and 5.7589 g/L โ are the exact last two readings of BR101.Titer.PV for our golden batch BATCH-2026-001, taken from the 1-minute trace in datasets/fedbatch_timeseries.parquet (the very file the loader below reads), which ends at 23:59. PI and our stack are looking at the same physical bioreactor; they had better agree to the digit. Second, PI Web API surfaces quality as Good/Questionable/Substituted flags rather than a single numeric OPC quality code. Our historian stores OPC-style quality (192 for Good, 64 for Uncertain). Bridging the two is a small mapping table you write once and validate โ never an afterthought.
Writing back is the same door in reverse: a POST to a point's recorded-values endpoint with a JSON body of {Timestamp, Value} items. This is how a computed tag โ say BR101.Titer.SoftSensor from the Chapter 26 Raman model โ re-enters the PI world so operators see it next to the instrument tags they trust.
Where the OSS copy lands: load it like any other sourceโ
On our side of the boundary, a value pulled from PI is just another time-series row. It lands in the same ts.sensor_reading hypertable that every capture chapter writes to, with the same six columns. The companion repo's loader shows the exact shape and the exact COPY path โ this is from examples/tools/load_datasets.py:
def load_timeseries(conn) -> int:
df = pd.read_parquet(DATA / "fedbatch_timeseries.parquet")
buf = io.StringIO()
df[["ts", "tag", "value", "unit", "quality", "batch_id"]].to_csv(buf, index=False, header=False)
buf.seek(0)
with conn.cursor() as cur:
cur.execute("TRUNCATE ts.sensor_reading")
with cur.copy("COPY ts.sensor_reading (ts, tag, value, unit, quality, batch_id) "
"FROM STDIN WITH (FORMAT csv)") as copy:
copy.write(buf.read())
return len(df)
The columns are the contract: ts, tag, value, unit, quality, batch_id. A PI bridge does nothing more exotic than this โ it turns PI Web API JSON into rows of exactly this shape and COPYs them in. The header of that same file states the design plainly:
This is the path Chapters 5-13 build up piece by piece; here it is one script so
the contextualization and ALCOA+ chapters have data to query. Idempotent: it
truncates the loaded tables first.
The word idempotent is the whole game for a bridge. Re-running the loader leaves the table in the same state, because it truncates first. A production PI bridge cannot truncate the world every run, so it earns idempotency a different way. Note one constraint the shipped schema imposes: platform/db/20-historian.sql builds ts.sensor_reading as a TimescaleDB hypertable with only non-unique indexes on (tag, ts DESC) and (batch_id, ts DESC) โ there is no primary key or unique constraint on (tag, ts). So a naive INSERT ... ON CONFLICT (tag, ts) would fail against the repo as built. The mechanism the current schema does support is delete-window-then-insert: inside one transaction, DELETE the rows for the tags and time window you are about to backfill, then INSERT the freshly read values. Re-reading an overlapping window then never duplicates a row. (If you want a true upsert, you would first add a UNIQUE (tag, ts) index โ and on a hypertable that index must include the partitioning column ts.) Same goal, different mechanism. The CSV the loader consumes is the same long format you saw the simulator emit (these are the last two 1-minute rows of the golden batch, the tail of fedbatch_timeseries.parquet):
ts,tag,value,unit,quality,batch_id
2026-01-18 23:58:00+00:00,BR101.Titer.PV,5.8214,g/L,192,BATCH-2026-001
2026-01-18 23:59:00+00:00,BR101.Titer.PV,5.7589,g/L,192,BATCH-2026-001
Notice the round-trip closes: the PI Web API JSON above and this CSV row describe the same measurement. The bridge's only job is to keep that true.
Backfill and reconciliation: making the two copies agreeโ
Live streaming is the easy 90%. The hard, GxP-relevant 10% is what happens after the network was down for two hours. The OSS copy now has a gap the PI original does not, because PI's validated collector buffered through the outage. This is precisely why OPC UA Historical Access matters [10]: the bridge can ask PI for the historical values across the missing window and fill the hole.
A robust backfill loop has four moves:
- Find the gap. Query our historian for the latest
tsper tag; anything newer in PI is missing here. - Read history. Pull PI Web API recorded values (or OPC UA HA reads) for
[last_ours, now]. - Replace the window, do not blindly append. In one transaction,
DELETEthe rows for the affected tags and[last_ours, now]window, thenINSERTthe freshly read values โ so an overlapping window is safe to re-pull without duplicating rows. (Switch toINSERT ... ON CONFLICTonly after you add aUNIQUE (tag, ts)index, which the shipped hypertable does not yet carry.) - Reconcile. Re-compare a sample of overlapping points and assert they match within tolerance; log any divergence as a data-integrity event, not a silent overwrite.
Apache NiFi is the natural home for this when you want a visual, replayable flow with provenance: its InvokeHTTP processor is an HTTP client that calls a configurable endpoint and sends a FlowFile body as the request, supporting GET for reads and PUT/POST/PATCH for writes [11] โ exactly the verbs PI Web API uses. NiFi's provenance then records who pulled what, when, and from where, which is gold during an investigation. The reconciliation step itself is plain SQL against the two copies:
-- Reconcile the OSS copy against PI for one tag/window.
-- Flags any point where the copies disagree by more than tolerance.
SELECT o.ts, o.value AS oss_value, p.value AS pi_value,
abs(o.value - p.value) AS delta
FROM ts.sensor_reading o
JOIN stage.pi_recorded p ON p.tag = o.tag AND p.ts = o.ts
WHERE o.tag = 'BR101.Titer.PV'
AND o.batch_id = 'BATCH-2026-001'
AND abs(o.value - p.value) > 1e-6; -- expect zero rows
If that query returns rows, your copy drifted โ perhaps a unit-conversion bug, perhaps a clock skew. Returning zero rows is the assertion your test suite makes, and it is the engineering proof that the true copy is faithful to the original.
Testing the bridge with no PI in reachโ
You will write almost all of this code with no PI server anywhere near you, and that is fine โ if you test against a contract rather than a vibe. The honest-hybrid design, now shipped in the repo, is the pi-web-api-stub that exposes the PIPoint, Stream, and AF endpoints a real PI exposes, serves the golden data, and accepts writes. Pin it to an OpenAPI spec and a contract test โ a fuzzing tool such as schemathesis can then exercise both the stub and the bridge against the same schema a real PI honors. The point the book makes plainly: the goal is a bridge that is real and tested against a contract, with the PI counterpart mocked. When you reach a site with a real PI, you change a base URL and a credential, run the same contract test against the real server, and you are done. The artifacts in this chapter that run today are the bridge (examples/chapters/17-bridge-pi-historian/pi_bridge.py) and its test against the stub (examples/tests/test_bridges.py), plus the loader above (examples/tools/load_datasets.py).
A minimal contract for the read path โ illustrative, the shape the shipped stub's OpenAPI pins:
# pi-web-api-stub โ read path contract (illustrative)
paths:
/piwebapi/streams/{webId}/recorded:
get:
parameters:
- { name: startTime, in: query, schema: { type: string } } # e.g. "*-2h"
- { name: endTime, in: query, schema: { type: string } } # e.g. "*"
responses:
"200": { description: Recorded values, content: { application/json: {} } }
This is the honest core of the whole chapter: we cannot run AVEVA PI on a laptop, so we do not pretend to. We pin the contract, test ruthlessly against it, and document the one-line swap to reality.
Why it mattersโ
Get this boundary wrong and you face one of two failures. Either you try to make the open-source historian the GMP record โ and inherit a validation, audit-trail, and Part-11 burden that no OSS historian carries out of the box, for no patient benefit. Or you let the two copies silently diverge, and your shiny Grafana dashboard quietly disagrees with the record auditors will actually read.
Getting it right is liberating. The validated PI System keeps doing its job โ original record, full audit trail, the thing the quality unit signs off [5]. Your open-source layer gets fast, cheap, unconstrained access to true copies for SPC, multivariate models, and soft sensors, and can hand computed insight back through a controlled door. GAMP 5's second edition is explicit that open source belongs in GxP โ inside a validated lifecycle, with supplier and risk assessment proportionate to use [12]. The bridge is the seam where that lifecycle is drawn.
In the real worldโ
Walk into almost any approved-product mAb plant and you will find a PI System at the center, surrounded by DCS collectors feeding it and analytics tools reading from it. The pattern in this chapter is not a teaching simplification; it is how integration teams actually live. The fed-batch CHO + Protein A line we model is the dominant modality for approved monoclonal antibodies, and its data has been landing in PI historians for two decades.
NIIMBL โ the U.S. public-private Institute for the advancement of biopharmaceutical manufacturing โ funds exactly this kind of interoperability work, and its SABRE facility (a pilot-scale cGMP facility at the University of Delaware, which broke ground in April 2024 and is still under construction as of mid-2026) is the sort of site where an OSS analytics layer would sit beside validated commercial systems, not replace them. The intensified/continuous variant of our process โ perfusion with multi-column capture โ only sharpens the point: more sensors, more tags, more reason to want cheap open-source analytics, and exactly the same reason to leave the validated historian alone.
The honest OSS-vs-commercial verdict for this layer: open source gets you a superb, scalable historian (TimescaleDB), world-class dashboards (Grafana), and a free, flexible bridge (Telegraf [9], NiFi [11], a Python PI Web API client). What it does not give you is a turnkey, validated, Part-11-ready system of record with a vendor on the hook when an inspector calls. PI gives you that. So the lower-risk, compliant move is the hybrid: PI keeps the original; OSS does the thinking [3][2]. Pure open source gets you about 80% of the platform; this is one of the seams where the last GxP mile is, honestly, hybrid.
Key termsโ
- Historian / record-of-truth โ the system that holds the official, audit-trailed process record. In most plants this is a validated AVEVA/OSIsoft PI System, not the OSS historian.
- PI Web API โ AVEVA's RESTful, HTTPS read/write interface to PI Data Archive and Asset Framework data [6].
- PI Asset Framework (AF) โ the asset-centric layer that contextualizes raw PI tags into hierarchies, attributes, and units [1].
- WebId โ the opaque identifier PI Web API uses to address a point, stream, or AF element.
- OPC UA Historical Access (HA) โ OPC UA Part 11; reading and writing historical values and events, the basis for backfill [10].
- Backfill โ filling a gap in the OSS copy by reading the missing window from the PI original after an outage.
- Reconciliation โ comparing the two copies over an overlapping window and flagging divergence rather than silently overwriting.
- Idempotent โ an operation safe to repeat; against the shipped historian (which has no unique constraint on
(tag, ts)), a bridge achieves it by deleting the affected window and re-inserting, so re-reading never duplicates rows. - True copy vs original record โ MHRA terms; the OSS layer holds true copies/derived data, PI retains the original [5].
- CSA (Computer Software Assurance) โ FDA's risk-based, least-burdensome assurance approach that lets you apply lighter rigor to the OSS analytics layer [3].
- cGMP โ current Good Manufacturing Practice; the binding expectation of controlled, documented, reproducible drug manufacturing.
Where this leadsโ
The historian was the friendliest commercial system to bridge, because it speaks open protocols and trades in time-series we already understand. The next chapter, Bridging to DCS, MES & ERP: DeltaV, Siemens, SAP, walks into harder territory โ control systems over OPC UA, the blunt verdict that there is no credible open-source GxP MES, and ERP exchange of materials, lots, and work orders via B2MML/ISA-95 messages โ where the honest-hybrid boundary is drawn not by preference but by necessity.