The Open-Source Historian: Choosing and Running a Time-Series Store
๐ Where we are: Part III ยท Storing & Connecting โ Chapter 13. The capture layer is now pushing a river of sensor readings at us; this chapter builds the place those readings live โ an open-source historian โ and is honest about where open source stops and a commercial PI server begins.
A process historian is a tireless clerk whose only job is to write down every number the plant emits, forever, with the time it happened โ and to hand any slice of it back to you in milliseconds. A bioreactor breathes out a reading every few seconds: temperature, pH, dissolved oxygen, titer. Over a 14-day batch that is millions of slips of paper. An ordinary database chokes on that; a historian is built for it. The commercial gold standard is AVEVA PI (formerly OSIsoft PI). In this book we build the open-source equivalent โ and we tell you plainly the two places it is not equivalent: the patented compression PI is famous for, and PI's native quality flag on every value.
What this chapter coversโ
In the capture chapters (5โ12) we wired sensors, edge gateways, and collectors that all funnel readings toward one table. This chapter is where that table actually lives. We will:
- build the historian as a TimescaleDB hypertable inside PostgreSQL, so high-rate sensor data and the relational batch model share one engine;
- pre-roll one-minute and one-hour summaries with continuous aggregates, and bound storage with a retention policy โ being honest that these conveniences are TimescaleDB Community (TSL) features, free to run but source-available rather than OSI open source;
- name the license trap out loud (the genuinely Apache-2.0 core is just hypertables,
create_hypertable,time_bucket, anddrop_chunks; continuous aggregates, retention/CAGG policies, and the Hypercore compression all sit under the source-available TSL) and survey the strictly Apache-2.0 alternatives โ Apache IoTDB, InfluxDB 3 Core, QuestDB; - explain swinging-door compression โ the algorithm every commercial historian uses โ and why a careless deadband can silently corrupt the record;
- and confront the one thing no open-source historian ships out of the box: PI's per-value data-quality flag, and how this repo carries it anyway.
The schema in this chapter lives in examples/platform/db/20-historian.sql and is applied by make seed; the data flowing into it is produced by the deterministic simulator in examples/sim/bioproc_sim/fed_batch.py. Both are real and tested.
One table to hold everythingโ
The heart of the historian is almost insultingly simple. Here is the whole table, from examples/platform/db/20-historian.sql:
CREATE TABLE ts.sensor_reading (
ts timestamptz NOT NULL,
tag text NOT NULL,
value double precision,
unit text,
quality smallint NOT NULL DEFAULT 192, -- OPC UA: 192 Good, 64 Uncertain, 0 Bad
batch_id text
);
Six columns. A timestamp, a tag name, a number, its unit, a quality code, and the batch it belongs to. This long, narrow shape โ one row per reading rather than one column per sensor โ is the defining choice of a historian. A new sensor is a new value of tag, not a schema migration; a thousand tags and one tag cost the same to model. It is the relational mirror of the OPC UA address space we read from in Chapter 7.
The next line is what turns an ordinary Postgres table into a time-series engine:
SELECT create_hypertable('ts.sensor_reading', 'ts', chunk_time_interval => INTERVAL '1 day');
CREATE INDEX ON ts.sensor_reading (tag, ts DESC);
CREATE INDEX ON ts.sensor_reading (batch_id, ts DESC);
A hypertable looks and behaves exactly like one table โ you INSERT and SELECT against ts.sensor_reading as normal โ but underneath, TimescaleDB automatically slices it into chunks partitioned by time, here one chunk per day [1]. That partitioning is why a query for "yesterday's titer" never scans last month: the planner touches only the chunks whose time range overlaps your query. The two indexes mirror the two questions the rest of the book asks โ give me one tag over time and give me one batch over time โ both kept in descending time order because the most-asked question is "what happened recently?".
Why a Postgres extension and not a purpose-built time-series server? Because the bioprocess world is fundamentally a join problem. A temperature reading is meaningless until it is tied to its batch, its phase, its equipment, and its recipe โ all of which live in the relational ISA-88/95 model we built in Chapter 3. Keeping the historian inside the same PostgreSQL instance means that join is a plain SQL join, no cross-system glue, which is exactly what Chapter 14 exploits. We trade some raw ingest throughput for the ability to ask process questions in one query. For a single mAb line that is the right trade.
The data that lands hereโ
The numbers in this table are not hand-typed; they come from the deterministic simulator in examples/sim/bioproc_sim/fed_batch.py, which models a 14-day fed-batch CHO culture โ logistic growth limited by glucose and glutamine, a death phase as nutrients deplete, lactate produced then consumed, and antibody titer accumulating with the integral of viable cells, all under PID-style controllers with bounded sensor noise. It declares its sixteen tags explicitly:
def _tag_specs() -> dict[str, str]:
return {
"BR101.Temp.PV": "degC",
"BR101.Temp.SP": "degC",
"BR101.pH.PV": "pH",
...
"BR101.OnlineGlucose.PV": "g/L",
"BR101.Titer.PV": "g/L",
}
The point worth dwelling on is determinism. The simulator seeds its randomness from one master value (SIM_SEED=2026) hashed with a per-stream label, so the same run produces byte-identical numbers on any machine โ which is what lets the book quote exact values and CI verify them. Running it as a smoke test produces:
$ python -m bioproc_sim.fed_batch
BATCH-2026-001: rows=322560 tags=16
final VCD=18.2e6 viab=64% titer=5.77 g/L
That rows=322560 is sixteen tags times 20,160 minutes โ a fortnight stored at one row per minute. The native acquisition is faster (a real skid emits every few seconds); 1/minute is the cadence the historian persists, a first, honest form of downsampling we will return to. A slice of the long-format stream, taken from examples/datasets/fedbatch_timeseries_10min.csv, is exactly what INSERTs into the table โ note that this committed golden CSV is deliberately thinned to one row every 10 minutes (32,256 rows; a short first-two-hours excerpt is also committed as fedbatch_timeseries_10min.sample.csv) to keep the repo small, so its rows are a downsampled view of the simulator's native 1/minute, 322,560-row stream, not a different dataset:
ts,tag,value,unit,quality,batch_id
2026-01-05 00:00:00+00:00,BR101.Temp.PV,37.0145,degC,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.DO.PV,40.8224,%sat,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.pH.PV,7.0511,pH,192,BATCH-2026-001
2026-01-05 00:00:00+00:00,BR101.Titer.PV,-0.0045,g/L,192,BATCH-2026-001
(Yes, titer reads slightly negative at inoculation โ that is measurement noise around a true value of zero, deliberately kept so the data behaves like a real probe, not a textbook.)
The quality flag: the column commercial historians have and most OSS forgetsโ
Look again at that quality column, and at how the simulator fills it. The fed-batch model injects a deliberate fault on day 7 โ a cooling excursion where the temperature setpoint dips half a degree and the dissolved-oxygen probe goes unreliable for three hours:
GOOD, UNCERTAIN, BAD = 192, 64, 0 # OPC UA quality codes
...
if excursion:
# day-7 cooling excursion: setpoint dips 0.5 degC for ~3 h, DO reads uncertain
e0 = int(7 * 24 * 60)
e1 = e0 + 180
temp_sp[e0:e1] = 36.5
temp[e0:e1] = 36.5 + rng.normal(0, 0.05, e1 - e0)
do_uncertain[e0:e1] = True
Those 192 / 64 / 0 numbers are OPC UA Data Access status severities โ Good, Uncertain, Bad โ carried from the source all the way into storage. During the excursion the temperature and DO readings are written with quality = 64, and you can see them in the stored data:
ts,tag,value,unit,quality,batch_id
2026-01-12 00:00:00+00:00,BR101.Temp.PV,36.593,degC,64,BATCH-2026-001
2026-01-12 00:10:00+00:00,BR101.Temp.PV,36.4887,degC,64,BATCH-2026-001
2026-01-12 00:20:00+00:00,BR101.Temp.PV,36.468,degC,64,BATCH-2026-001
This matters more than it looks. A value of 36.47 ยฐC that is Good and a value of 36.47 ยฐC that is Uncertain are different facts about the world, and conflating them is a data-integrity failure โ the "A" for Accurate in ALCOA+ depends on a reading carrying its own trustworthiness. Commercial PI has carried a quality/substituted-data flag on every point for decades. Here is the honest open-source reality: none of the open historians ships PI's native quality model โ not TimescaleDB, not IoTDB, not InfluxDB, not QuestDB. So this repo does the obvious thing PI does for you: it makes quality a first-class, NOT NULL column with an explicit default of 192 (Good) and lets the collector down-rate it. That is a small piece of design discipline, not a feature you download โ and it is the kind of gap this book exists to name.
The open-source historian: one long-narrow hypertable auto-partitioned into daily chunks, fed by sixteen tags, pre-rolled into 1-minute and 1-hour continuous aggregates, bounded by a retention policy, with the OPC UA quality flag carried on every row. Original diagram by the authors, created with AI assistance.
Pre-rolling the data: continuous aggregatesโ
A dashboard does not want 20,160 raw points to draw a 14-day trend on a 1,200-pixel screen โ it wants a summary. Computing avg/min/max over the raw table on every refresh is wasteful when the answer barely changes. TimescaleDB's continuous aggregates solve this: they are materialized views over a hypertable that refresh incrementally as new data lands, so you never recompute history. From examples/platform/db/20-historian.sql:
-- 1-minute rollup (avg/min/max/last) as a continuous aggregate
CREATE MATERIALIZED VIEW ts.sensor_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS bucket,
tag,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value,
last(value, ts) AS last_value
FROM ts.sensor_reading
GROUP BY bucket, tag
WITH NO DATA;
time_bucket is the time-series analogue of GROUP BY: it rounds each timestamp down to its one-minute slot so every reading falls into a bucket. The last(value, ts) aggregate โ picking the most recent value in each bucket, not just the average โ is the one a process engineer reaches for constantly and that plain SQL makes awkward. A second view, ts.sensor_1h, rolls the same data up to hourly granularity for long-range trending. Crucially we keep min and max alongside avg: a one-minute average would have smoothed away a brief spike, but the max column preserves it. That is the difference between a summary you can trust for a deviation investigation and one that quietly hides the evidence.
The aggregates do not refresh themselves by magic; a policy schedules it:
SELECT add_continuous_aggregate_policy('ts.sensor_1m',
start_offset => INTERVAL '3 days', end_offset => INTERVAL '1 minute',
schedule_interval => INTERVAL '1 hour');
Read that as: every hour, refresh the one-minute rollups for data between three days ago and one minute ago โ recent enough to be useful, far enough back that late-arriving readings have settled.
One honest caveat, developed fully in the license section below but flagged here so it is not a surprise: continuous aggregates (CREATE MATERIALIZED VIEW โฆ WITH (timescaledb.continuous)) and the background-job scheduler that drives add_continuous_aggregate_policy are TimescaleDB Community (TSL) features, not Apache-2.0 ones. They are free to run, but they are source-available, not OSI open source โ the same caveat that applies to Hypercore compression. If you must stay strictly Apache-2.0, the equivalent is a plain CREATE MATERIALIZED VIEW refreshed on an external cron schedule; you lose the incremental, never-recompute-history behaviour but keep a clean licence.
Retention: keeping the right amount, for the right timeโ
A historian that never forgets eventually fills the disk. The opposite mistake โ forgetting a record the law requires you to keep โ is worse. TimescaleDB lets you express retention declaratively, dropping whole aged chunks rather than deleting rows one by one:
-- keep raw readings for 400 days (multi-jurisdiction retention is set per region
-- in Chapter 23; this is a safe default longer than any single chapter needs).
SELECT add_retention_policy('ts.sensor_reading', INTERVAL '400 days');
A word on what is and is not open source here, because it is exactly the kind of detail this book refuses to gloss over. The manual primitive โ calling drop_chunks('ts.sensor_reading', older_than => INTERVAL '400 days') yourself, on your own schedule โ is Apache-2.0. The declarative add_retention_policy above, which registers a background job so you never have to remember, is a TimescaleDB Community (TSL) feature [2], riding the same job scheduler as the continuous-aggregate policy. It is free to run but source-available, not OSI open source. A strictly Apache-2.0 stack would replace this one line with a cron job that calls drop_chunks directly. Either way the mechanism is the same: dropping a chunk is cheap because it is a partition-level operation; deleting 322,560 rows with a WHERE clause is not.
But the number 400 days is not arbitrary, and this is where regulation, not engineering, sets the dial. U.S. cGMP requires batch records be retained at least one year past the batch's expiration date, with retained electronic records or true copies kept readily retrievable [3]. The EU's Annex 11 goes further on the medium: stored and archived data must be secured and periodically checked for accessibility, readability, and integrity across the entire retention period โ you cannot just keep the bytes, you must keep them readable [4]. So 400 days is a deliberately conservative single-instance default; the real, per-jurisdiction retention matrix is data the platform loads in Chapter 23, because a global manufacturer keeps the same data for different lengths of time depending on where it was made.
The license trap, said plainlyโ
Here is the honesty this book promised, and it is sharper than the convenient story. TimescaleDB is dual-licensed, but the Apache-2.0 line falls in a less generous place than most write-ups admit. The genuinely Apache-2.0 core is small: hypertables and create_hypertable, the time_bucket function, the first/last aggregates, and manual chunk management via drop_chunks [5]. A large set of the convenient features lives in a tsl/ directory governed by the source-available Timescale License (TSL), which is not an open-source license under the OSI definition. Crucially, three things this chapter leans on are TSL, not Apache-2.0: continuous aggregates (the WITH (timescaledb.continuous) materialized views), declarative retention (add_retention_policy), and the background-job scheduler behind add_continuous_aggregate_policy. The headline TSL feature is the Hypercore columnstore and native compression โ the very thing you would most want for a multi-year historian, and the thing PI does brilliantly โ but the automation we used above is on the same side of the line. None of this costs money; the TSL is free-to-use and source-available. It simply is not OSI open source, and pretending otherwise is the exact licensing overstatement this book exists to avoid.
So this chapter takes the pragmatic honest-hybrid path: we run the TSL Community automation (continuous aggregates and add_retention_policy) because it is free and excellent, while staying off the one TSL feature โ Hypercore compression โ whose absence costs us only disk. The comment block at the top of examples/platform/db/20-historian.sql names that boundary exactly:
-- Apache-2.0 core (hypertables, create_hypertable, time_bucket, drop_chunks) plus
-- free TimescaleDB Community (TSL) automation: continuous aggregates and
-- add_retention_policy. TSL is free-to-use and source-available, but NOT OSI
-- open source. We deliberately do NOT use the TSL Hypercore columnstore/compression,
-- so a strictly Apache-2.0 build is one cron-driven drop_chunks away โ see Chapter 13.
That is the trade-off in one paragraph. The companion stack pins the standard timescale/timescaledb:2.17.2-pg17 image, which bundles the free TSL Community features, so the continuous aggregates and add_retention_policy in this file run as written. If you must instead be strictly Apache-2.0 โ for example to redistribute the stack without any source-available component โ switch to the Apache-only -oss build, which does not even expose the TSL functions, and replace the continuous aggregates with plain materialized views refreshed by cron and add_retention_policy with a scheduled drop_chunks. The historian is then under a true open-source licence, at the cost of that convenience. Conversely, if your organisation is comfortable with TSL terms, turning on Hypercore is a one-line change and a large storage win. Each of these is a licensing decision, and we make it visible rather than smuggling it in.
The strictly Apache-2.0 alternatives exist precisely for teams that will not accept source-available terms at all. Apache IoTDB is cleanly Apache-2.0 and device-native: it models each series as a (device, measurement, timestamp, value) path and ships its own columnar TsFile format, a natural fit when you think in terms of equipment trees rather than SQL tables [6]. InfluxDB 3 Core โ the open-source tier, MIT/Apache-2.0, rebuilt on Apache Arrow, DataFusion, and Parquet โ is permissive, though the Enterprise and Cloud tiers are not, which is why a careless influxdb:latest pull is a known trap [7]. QuestDB is Apache-2.0 with purpose-built SQL time-series operators like SAMPLE BY, LATEST ON, and ASOF JOIN that make time-aligned queries terse [8]. We ship TimescaleDB as the default because the join-to-the-batch-model story is so much cleaner inside one Postgres โ and we are explicit that the default uses free TSL Community automation, while a reader who needs strictly Apache-2.0 everywhere has a real, named path.
Swinging-door compression: power and perilโ
The feature TSL gates โ and the technique every commercial historian leans on โ deserves its own explanation, because it is where "save space" can quietly become "change the record." The classic algorithm is swinging-door trending, patented by Bristol in 1987 [9]. The intuition: instead of storing every point on a slowly changing signal, store a point only when the line can no longer be drawn through the readings within a tolerance band โ a "deviation" or deadband. A flat temperature trace holding at 37.0 ยฐC for an hour collapses from 3,600 points to a handful, reconstructed by linear interpolation.
That tolerance parameter is a knife that cuts both ways. It is the single dial that governs the trade-off between storage reduction and reconstruction error [10]. Set it loose and you get spectacular compression โ and you can smooth away the very excursion you are obligated to detect. Studies of swinging-door on high-rate sensor streams quantify exactly this: an over-aggressive tolerance introduces real reconstruction error that distorts the recorded signal [11]. For a GMP record that is not merely a quality concern; it is the Accurate in ALCOA+ at stake. Our day-7 excursion peaks only ~0.5 ยฐC below setpoint โ well inside a sloppily configured deadband. The lesson the chapter wants you to leave with: lossy compression is legitimate and ubiquitous, but the deadband is a validated parameter, not a storage-saving afterthought, and the safest open-source posture is to store the raw record and downsample with explainable rollups (our continuous aggregates) rather than reconstructing from a lossy original.
Why it mattersโ
The historian is the floor the whole platform stands on. Get its shape wrong and every chapter above it inherits the mistake. Three decisions in this one short DDL file carry the most weight. The long-narrow schema means adding sensors never costs a migration. Keeping the historian inside PostgreSQL means contextualization (Chapter 14) is a join, not an integration project. And carrying the OPC UA quality flag as a first-class column means a reading's trustworthiness travels with it into the permanent record โ which is the technical precondition for ALCOA+ data integrity and for any audit-trail review that follows. The pieces we deliberately left out โ TSL compression, swinging-door lossy reduction โ matter just as much, because choosing not to use them is what keeps the stack cleanly open and the recorded signal faithful.
In the real worldโ
For thirty years the answer to "where do plant signals live?" has been OSIsoft PI, now AVEVA PI โ a mature, validated, vendor-supported historian with native compression, a per-point quality model, and an asset framework, deployed in nearly every large biomanufacturer. An open-source historian gets you genuinely far: TimescaleDB or IoTDB will ingest your tags, roll them up, retain them, and serve them at no licence cost, and they do it well. The honest gaps are specific and worth stating without flinching. You do not get PI's patented compression for free (it is TSL-gated or absent). You do not get a built-in quality flag (you build the column, as we did). And you do not get a vendor's validated-system package, support contract, or supplier accountability that a GAMP-5 assessment leans on โ that burden becomes yours, and Chapters 17 and 22 take it seriously, including a real bidirectional bridge to PI for the very common case where PI stays the system of record and the OSS stack is the analytics layer beside it. That hybrid is not a failure of open source; it is the realistic shape of a regulated plant.
NIIMBL โ the U.S. public-private Institute for innovation in biopharmaceutical manufacturing โ is building exactly the kind of multi-vendor, sensor-dense line where this question bites. Its SABRE facility (the NIIMBL / University of Delaware pilot-scale cGMP โ current Good Manufacturing Practice โ plant that broke ground in April 2024) will generate streams from many skids that all have to land somewhere queryable and retainable. SABRE is a facility under construction, not a data standard or a historian product, but it is precisely the setting that makes the "one trustworthy time-series store" problem in this chapter concrete rather than academic.
Key termsโ
- Historian โ a database specialized for storing and serving high-rate, timestamped process signals; the open-source analogue of AVEVA/OSIsoft PI.
- Hypertable โ a TimescaleDB table that behaves like one table but is auto-partitioned into time-ranged chunks, so queries scan only the relevant time windows.
- Chunk โ one time-range partition of a hypertable (here, one day); the unit that retention drops and that the planner prunes.
- Continuous aggregate โ a materialized view over a hypertable that refreshes incrementally as data arrives, used to pre-roll
avg/min/max/lastsummaries. - Retention policy โ a scheduled rule that drops chunks older than a set interval, expressed declaratively rather than as row deletes.
- Quality flag โ a per-reading code (OPC UA: 192 Good, 64 Uncertain, 0 Bad) recording how trustworthy the value is; first-class here, absent by default in OSS historians.
- Long (narrow) schema โ one row per reading (
ts, tag, value, โฆ) rather than one column per sensor; lets a new tag be data, not a migration. - TSL (Timescale License) โ the source-available licence governing TimescaleDB's
tsl/features (Hypercore columnstore, native compression); not OSI open source, and deliberately unused here. - Swinging-door trending โ the classic lossy historian compression algorithm; its deviation/deadband tolerance trades storage against reconstruction error and must be treated as a validated parameter.
Where this leadsโ
The historian now holds a faithful, retainable, quality-tagged river of readings โ but every one of those readings is still mute about which step of which batch it belongs to. In Chapter 14 โ Contextualization: Joining Time-Series to the Batch, we marry this ts.sensor_reading hypertable to the ISA-88/95 batch model from Chapter 3 with a single temporal-join view, so a reading stops being "37.04 ยฐC at some instant" and becomes "37.04 ยฐC during the Production phase of BATCH-2026-001 on BR101" โ the moment raw data turns into process knowledge.