The Plant Information Systems: Historian, MES, LIMS, and ELN

📍 Where we are: We climb one floor above the automation layer to meet the four information systems — historian, MES, LIMS, and ELN — that each own a slice of the plant's data.

In the previous chapter, Automation and Process Control Data, we sat down at the machine room floor: the PLCs (programmable logic controllers), DCS (distributed control system), and SCADA (supervisory control and data acquisition) systems that run the process and stream out setpoints (the target values the controllers steer toward), alarms, events, and recipes, all structured by the ISA-88 batch-control standard that defines how a recipe is shaped. Those systems are brilliant at acting in the moment. But a single second's worth of sensor readings is useless unless something catches it, stores it, gives it meaning, and keeps it for years. That "something" is not one system — it is a constellation of them, sitting on the floor above the controllers, each speaking its own dialect.

This chapter introduces the four members of that constellation: the process historian (time-series data), the MES (batch execution), the LIMS and ELN (the laboratory's world), plus a few important relatives. Each owns a different slice of the truth. And the gaps between them — the integration seams — are where this entire book finds its central problem.

The simple version

Imagine a hospital recording one patient. A bedside heart monitor scribbles a continuous trace every second (that's the historian). A nurse fills in the official treatment chart, signing off each step against the doctor's orders (that's the MES). The lab files blood-test results in its own system (that's the LIMS). And a researcher keeps a notebook of experiments tried along the way (that's the ELN). All four describe the same patient — but they don't automatically talk to each other, and they each call the patient something different.

What this chapter covers

We'll meet each system in turn — what it is, what data it owns, and why it exists separately — then step back to see the integration seams and the silos that form along them.

One physical plant feeding several separate information systems — historian, MES, LIMS, ELN — with the integration seams highlighted between them One physical plant, many information systems — the data's value lives in the seams between them. Original diagram by the authors, created with AI assistance.

The four systems, one at a time

We'll meet each member of the constellation in turn — what it owns, why it exists separately, and what kind of identifier it stamps on the batch as the lot moves through it. The same physical batch is born in the production bioreactor we met in the manufacturing book; here we watch the data about that batch split into four separate accounts.

The historian: time-series identity and tag nomenclature

A process historian is a database built for one job: storing time-series data — long streams of measurements, each stamped with the exact time it was taken. Every measured point in the plant — a temperature probe, a pH electrode, a flow meter — is a tag, a named channel that produces a value many times a minute. Tags follow a structured naming convention so a human (and a machine) can decode them at a glance: a tag like BR-201.Temp.PV reads as bioreactor 201, temperature, process value, while BR-201.Temp.SP is the same probe's setpoint. A mid-sized biomanufacturing line can hold tens of thousands of tags, the fastest of which produce a value every second or more often. Over a month-long production campaign, this accumulates billions of data points — around 26 billion in a single 30-day run if 10,000 tags each sample once per second.

Each stored point is more than a number. A single historian row pairs the timestamp, the tag, the value, and a quality flag that says whether the instrument was healthy when it reported:

2024-06-13T14:03:07.123Z,BR-201.Temp.PV,36.8,GOOD
2024-06-13T14:03:08.123Z,BR-201.Temp.PV,36.8,GOOD
2024-06-13T14:03:09.123Z,BR-201.pH.PV,7.02,GOOD

Why not just use an ordinary relational database (the spreadsheet-like tables most business software runs on)? Because relational databases buckle under that firehose of writes, and storing every raw point would be ruinously expensive. Commercial historians like OSIsoft PI (now AVEVA PI System) — among the most widely deployed historians in biopharma — along with GE Proficy Historian and Honeywell PHD are purpose-built to ingest tens of thousands of tags and answer queries across years of them. The same job can now be done with open-source time-series databases (TSDBs) — TimescaleDB, InfluxDB, QuestDB, Apache IoTDB — which is exactly the route the companion Open-Source Bioprocess Data Systems book takes, building a working historian on a TimescaleDB hypertable instead of a proprietary one. They solve the volume problem with compression — classically the swinging-door (and related deadband) algorithms that keep only the points needed to reconstruct the signal within a defined tolerance and discard the redundant ones in between. A deadband is that tolerance band — a margin of allowed wobble around the expected value, within which a new reading is treated as "no real change" and not stored. Swinging-door compression, for instance, discards points that fall within a defined deadband around the projected trend line. The result is enormous storage savings while the shape of the curve is preserved.

That trade-off is also a data-integrity question. Suppose a pH probe reads 7.00, 7.02, 7.01, 7.03, 7.00, 7.02 — a flutter of ±0.03 over 30 seconds. A swinging-door algorithm with a 0.05 pH deadband would keep only the first and last point and discard the four in between, and you would lose nothing that matters. But widen that deadband to 0.15, and a real excursion from 7.00 up to 7.15 that lasts only 10 seconds can vanish entirely — and with it your ability to prove you caught the deviation. Compression that is too aggressive can quietly erase a real excursion, so regulators expect the original record and its meaning to survive. The FDA's data-integrity guidance and the harmonized PIC/S guidance both insist that captured process data remain complete, attributable, and reconstructable across its lifecycle — historian compression settings included [5][8].

note

A historian answers "what was the temperature at 14:03:07 on Tuesday?" in milliseconds across years of data. A relational database would struggle to even hold the question. Different jobs, different tools.

The same tag-and-deadband story plays out just as sharply downstream, in purification, and it is worth one concrete example because the historian is too easily pictured as an upstream-only, bioreactor-temperature affair. Take Protein A capture — the affinity-chromatography step that fishes the antibody out of the clarified harvest (the capture chromatography chapter walks it in full). Its skid streams its own tags: a UV280 absorbance trace (PAC-01.UV280.PV, in milli-absorbance units), conductivity, pH, and flow, sampled every second through the load, wash, low-pH elute, and clean-in-place cycle. The historian's deadband is doing real work on every one of them — but here the integrity stakes are higher than on a slow temperature probe, because the shape of the UV280 elution peak is the record that justifies the pooling-window cut points (the two thresholds, set against the live UV trace, between which the eluted antibody is collected as product and outside which it goes to waste). Compress that peak too aggressively and you can erase the breakthrough shoulder that signals product escaping unbound, or blur the peak's leading and trailing edges that the cut points are defined against — losing exactly the evidence a reviewer needs to prove the right slice was kept. The low-pH elution then flows straight into viral inactivation (a held acid step that kills enveloped viruses), whose hold pH and time are themselves historian tags a regulator will ask to reconstruct minute by minute. Downstream is not an afterthought for the historian; it is where compression settings most directly touch product quality.

The MES batch record: the official as-executed sequence

If the historian watches, the MES — Manufacturing Execution System — governs. Sitting between the control floor and the business systems above, the MES manages how a batch is actually made: it dispatches work instructions, enforces the approved recipe step by step, and refuses to let an operator skip ahead or use the wrong material. Commercial MES platforms built to enforce recipes this way include AVEVA Wonderware, Siemens Opcenter Execution, and Körber's Werum PAS-X (a pharma-specific MES).

Its signature output is the EBR — electronic batch record — the digital replacement for the old paper binder that documented every action in making a batch. The MES enforces the master recipe (the approved, master template for how the product is made) and produces, for each lot, a complete signed account of what was done, by whom, and when. This makes the MES the system of record for batch execution: the single authoritative source for "how this batch was manufactured."

The recipe's structure is standardized — ISA-88, which the previous chapter introduced, defines the procedural model of phases, operations, and unit procedures that the master recipe is built from. What is not standardized is the biopharma-specific content of that recipe and its batch record. ISA-88 (and its XML format for exchanging batch data between systems, B2MML — Business To Manufacturing Markup Language) tells you how to shape a recipe; it does not tell you which parameters a CHO (Chinese hamster ovary) mammalian-cell culture, a perfusion bioreactor, or a protein-A capture step must record, nor what to call them. There is no widely adopted companion specification that fills that biopharma-specific gap, so each MES vendor models a cell-culture batch its own way. That missing layer is one root cause of the silos this chapter is about: even two plants running ISA-88-compliant systems can disagree on how a CHO batch is described, which is precisely the problem that motivates the shared vocabularies and ontologies the later chapters reach for.

Because that record is legally binding, it must satisfy the U.S. FDA's 21 CFR Part 11, the regulation governing electronic records and electronic signatures. Part 11 does not stand alone: it rides on a predicate rule — the underlying GMP (Good Manufacturing Practice) regulation that says the record must exist in the first place. For a batch record that predicate rule is 21 CFR Part 211, the current Good Manufacturing Practice (cGMP) for finished pharmaceuticals, which requires a complete master production record (§211.186) and a contemporaneous batch production record documenting every significant step, by whom and when (§211.188) [12]. Part 11 then adds the electronic layer on top: secure, computer-generated, time-stamped audit trails, controls over who can do what, and signatures bound to records so they cannot be transplanted or repudiated [4]. In EU-approved facilities the comparable rulebook is EU Annex 11 (EudraLex Volume 4, Good Manufacturing Practice, Annex 11: Computerised Systems), which lays down parallel requirements for validation, audit trails, and access controls. The long-standing 2011 text is now being modernized: a revised Annex 11 went out for public consultation in 2025 — extending its reach to audit-trail review, cloud and software-as-a-service providers, data integrity across networked multi-system environments, and AI/ML, and formalizing the ALCOA+ data-integrity principles (defined below) — with finalization expected in 2026 [11].

Review by exception: how the EBR triages itself

The EBR's quietest superpower is review by exception. A paper batch record forces a quality reviewer to read every line of every step to confirm the batch was made correctly; for a biologics lot that can be hundreds of pages. A structured ISA-88 EBR turns most of that reading into arithmetic the computer does itself.

The mechanism is parametric. Each recipe step carries its acceptance criteria as data — an inoculation (seeding the culture with cells) temperature that must stay within 36.5–37.5 °C, a hold time (how long a step is held before proceeding) of at least 30 minutes, a weight checked against a target. As the operator executes the step, the MES compares the captured value against those limits in real time and stamps the step in-tolerance or out-of-tolerance. A step that passes is auto-confirmed: it needs no human reading, only the standing approval baked into the master recipe. A step that fails is raised as a flagged exception — and now a human must look.

What the human does next depends on the kind of exception. If the deviation was anticipated and a procedure already governs it — say, a documented allowance to extend a hold by a defined margin — it is a managed (planned) deviation: the EBR records the justification against the step and the batch proceeds under control. If it was not anticipated — a temperature excursion outside any pre-approved range — it is an unplanned deviation, and the flagged step is linked to a formal deviation record that opens an investigation and, where warranted, a CAPA (corrective and preventive action). That link is the payoff of a structured EBR: the offending step, its captured values, its audit trail, and the deviation/CAPA record are bound to the same batch, so an investigator can pull the whole thread at once instead of reconstructing it from paper.

Review by exception is also what gates batch disposition. A lot cannot be released while any flagged exception is still open: every unplanned deviation must be closed — investigated, justified, and its CAPA resolved — before the Quality Unit — the plant's independent quality organization responsible under GMP for releasing or rejecting each lot — signs the disposition decision (release, reject, or hold). The reviewer's attention is spent only on the handful of steps the system could not clear on its own, which is exactly why a structured EBR is faster and more trustworthy than the paper binder it replaced — the routine is confirmed by the machine, and the human judges only the exceptions.

LIMS and ELN: when a batch becomes many samples

Manufacturing makes the product; the laboratory decides whether it's good enough to release. Two systems own that world. This is the same lab world the manufacturing book reaches at QC and release — the point where a batch is judged fit (or not) to ship.

The LIMS — Laboratory Information Management System — tracks samples and results. When a vial is pulled from a bioreactor, the LIMS assigns it an identity, routes it to the right tests, records who ran each test on which instrument, holds the specifications (the pass/fail limits a result must meet, set using analytical procedures that are developed under ICH (International Council for Harmonisation) Q14 [10] and validated under ICH Q2(R2) [13] — the harmonized companion guidelines for, respectively, designing an analytical method and proving it fit for purpose), and judges each result against them. LIMS vendors such as LabVantage, Thermo Fisher SampleManager, and STARLIMS specialize in exactly this sample-and-result tracking. It is the system of record for quality-control (QC) data — the structured, regulated answer to "did this batch meet spec?"

The ELN — Electronic Lab Notebook — is the digital descendant of the bound paper notebook. Where the LIMS handles routine, structured testing, the ELN captures the exploratory, narrative work: the experiments a scientist designs, the conditions tried, the reasoning, the unexpected result worth chasing. ELN platforms such as IDBS E-WorkBook, Labguru, and Benchling let scientists record experiments in this freer form. The boundary blurs in practice — many vendors bundle both — but the distinction matters: LIMS answers "is this sample within spec?", while the ELN answers "what did we try, and why?"

Both fall squarely under data-integrity expectations summarized by the modern ALCOA+ principle — data should be Attributable, Legible, Contemporaneous, Original, Accurate, and (the "+") Complete, Consistent, Enduring, and Available — which applies to any GxP record regardless of which system holds it. (GxP is the umbrella for the "good practice" quality regulations — Good Manufacturing, Laboratory, and Clinical Practice and their kin — that govern any record a regulator may inspect.) The FDA's data-integrity Q&A and PIC/S PI 041 apply these expectations to laboratory and chromatography systems just as firmly as to the shop floor [5][8].

Solutions and equipment arranged inside a laminar-flow cabinet in a laboratory, with no identifiable personnel A controlled laboratory workspace. The physical plant is mirrored by a constellation of information systems — fed by continuous sensor streams (the historian), governed by step-by-step batch instructions (the MES), tested offline against analytical results (the LIMS), with experiment notes kept separately (the ELN) — each owning a different slice of the data. Laminar-flow cabinet. Image by syed sajidul islam, licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/), via Wikimedia Commons; used unmodified.

The supporting cast

Four systems don't tell the whole story. Around them sit several relatives:

SCADA archives — the supervisory-control layer often keeps its own short-term store of operational history, separate from the long-term plant historian.
BMS / EMS — the Building Management System and Environmental Monitoring System watch the room, not the process: cleanroom temperature, humidity, differential pressure, and airborne-particle and microbial counts. In a sterile facility these records are part of the release decision.
CDS — the Chromatography Data System acquires and processes the signals from analytical chromatography instruments; widely used platforms include Waters Empower, Agilent OpenLab CDS (successor to its ChemStation platform), and Shimadzu LabSolutions. The CDS category as a whole sits in a specialized, heavily regulated data world that regulators have long treated as a data-integrity focus area, which is why data-integrity guidance singles out chromatography systems for particular attention [5].
ERP — the Enterprise Resource Planning system, up at the business level, owns materials, inventory, and orders, and exchanges information with the MES at the boundary between operations and enterprise.

Each of these is, in the language of computerized-system validation, a GxP-relevant system that must be assessed and validated for fitness — the risk-based discipline laid out in GAMP 5, the industry's standard playbook for assuring such systems [6].

How the pieces fit — and where they don't

Here is the constellation in one view, anchored to the layered hierarchy that the ANSI/ISA-95 standard for plant-system layers (who owns what) defines between sensors and the enterprise [1].

Each box owns a different slice of the same batch. The single-headed arrows are one-way sources (a PLC feeds the historian); the double-headed arrows are bidirectional flows where data must be reconciled across a boundary (MES ↔ ERP). Those seams — where the arrows meet — are exactly where silos form when the reconciliation breaks down. Original diagram by the authors, created with AI assistance.

The problem this book keeps circling lives in those seams. Each system was built by a different vendor, for a different purpose, with its own internal vocabulary. The same manufacturing batch appears in the historian as a tag prefix (e.g., BR201_BATCH0156), in the MES batch record as a formal batch number (BATCH_2024_0156), in the LIMS as the parent of many test samples (S-0156-001, S-0156-002, and so on), and in the ERP as an inventory lot (LOT-22A-MABX) — each with its own identifier scheme. Worse, the LIMS entries are themselves subdivisions: the batch is one thing, but the samples drawn from it are many, so tracing a single sample result back to the original batch conditions means bridging three or four systems. The same real-world batch wears several different names, and nothing automatically knows they belong together.

Anatomy of one batch's identity

To see the silo concretely, dissect a single lot the way the previous chapter dissected an OPC UA node. One physical mAb (monoclonal antibody) batch leaves five different fingerprints — one per system of record, the four we just met plus the ERP lot up at the business level — and the field that should tie them together — a shared batch_id — does not exist in any of them.

The same lot wears five names. The missing piece is not data — it is the shared key that would let a machine know these five identities describe one thing. Original diagram by the authors, created with AI assistance.

Notice what the card is not showing: there is no column for "the canonical batch identifier," because no system holds one. The link between BR201_BATCH0156 and S-0156-002 lives only in the memory of the analyst who pulled the sample and typed the number. That is the silo, made specific.

The identifier gap: five names for one thing

Reading the card row by row tells the same story across time. Follow the lot down the four manufacturing-and-lab timelines below — historian, MES, LIMS, ELN, before it ever reaches the ERP — and every crossing from one system to the next is stitched by a human — an operator transcribing a number, an analyst pasting a prefix into a search box — never by an automatic key.

The peak-titer instant — the moment the antibody concentration (the titer) peaked — that the historian recorded, the production phase the MES logged, the sample the LIMS pulled, and the method the ELN described are all the same moment in the same batch — yet a query engine cannot know that without a human first asserting it. This is the classic data silo: rich data, trapped in incompatible boxes. The peer-reviewed literature on digital twins in biopharma puts it bluntly — the field's data sits in disconnected sources, and an integrated data layer is the prerequisite for breaking the silos [9]. It is also exactly what the FAIR principles — that data should be Findable, Accessible, Interoperable, and Reusable — were written to fix [2].

The shared key these five names are missing is the thread that the first part of this book named batch genealogy — the lineage that ties a finished vial back to every run, sample, method, and consumed material lot that shaped it (the data shadow chapter built this thread over the companion guide's s88.batch table and its lot_genealogy edges). Genealogy is precisely what spans these systems: it crosses from the ERP's consumed-material lots, through the MES batch record, into the LIMS sample tree. Where that genealogy is asserted by hand rather than carried as data, the silo is not just an inconvenience — it is the gap a regulator probes when asking you to trace a result back to its origin.

Why the seams matter: the integration-cost question

For data management, the lesson is that there is no single "plant database." There is a federation of systems of record, each authoritative for its own slice, and value is created — or lost — at the boundaries between them. A question as simple as "which culture conditions gave the highest-purity batch?" requires joining the historian (conditions), the MES (which batch), and the LIMS (purity) — three systems, three vocabularies, three identifier schemes. If the seams aren't bridged, that question simply can't be answered, no matter how much data was collected.

The cost is not abstract. Every cross-system question becomes a small integration project: someone writes brittle mapping code, or — more often — exports spreadsheets and reconciles identifiers by hand. The bill comes due twice: once when the analysis is slow and expensive, and again at inspection, when a reviewer asks you to trace a result back to its manufacturing conditions and the chain of custody runs through a human's recollection rather than a verifiable link. The decision of who owns which slice — and how those owners hand the batch across each boundary — is exactly the layered question the next chapter formalizes through the ISA-95 architecture, and that strong data governance has to assign and enforce.

Why the seams matter for machine learning: the batch is the unit of trust

The silo is not only an analytics tax; it is the hidden reason most plant machine-learning models report a score they will never reproduce in production, and the connection runs straight through the missing batch_id. A soft sensor that predicts titer from the historian's spectra is trained on rows and tested on rows — but the honest unit of an independent observation in bioprocess is not the row, it is the batch. If the same batch's rows land on both sides of the train/test split, the model has effectively seen the answer, and its score is inflated by data leakage — the cardinal sin where information from the evaluation set bleeds into training. The discipline that prevents it is grouped cross-validation (also called leave-one-batch-out): hold whole batches out, never individual rows, so the score measures generalization to an unseen batch the way production will. And you simply cannot do it without a trustworthy shared key — grouping by batch requires every historian row, every LIMS sample, and every MES phase to agree on which batch they belong to. The silo this chapter describes is therefore the upstream cause of the most common and most damaging mistake in bioprocess ML; the ML book's Data, the Fuel and Models and Validation chapters make the batch-grouped split the harness every claim must survive.

The same missing key bites again after deployment. A locked production model needs two guards the silo undermines. An applicability-domain check (a test of whether a new input lies inside the envelope of conditions the model was trained on, so an out-of-range spectrum is flagged rather than silently extrapolated) needs the contextualized history to define the envelope in the first place. And drift detection — watching for the model's inputs or errors to wander as the living process and its instruments slowly move — must be able to distinguish a genuine process drift (a real biological shift worth investigating) from a mere data-pipeline break at a seam (a renamed tag, a LIMS row filed under a second identifier), and only an integrated layer that ties every signal back to one batch identity can tell those two apart. This is why the digital-twins literature insists the integrated data layer — not more sensors, not a fancier model — is the prerequisite for breaking the silos [9]: a model is only as trustworthy as the join beneath it. The same locked-model, change-controlled discipline that the MLOps and lifecycle chapter builds, and that the CSV-to-CSA validation chapter frames for any computerized system, rests on the batch identity this chapter shows is so often missing.

Standards that could fix it: semantics and batch reconciliation

The industry's response has two layers. Technically, the move is toward OPC UA (Open Platform Communications Unified Architecture), a vendor-neutral standard that carries not just values but their meaning — semantics — across the seams between control systems, historians, and the MES/enterprise layers [7]; the connectivity standards chapter takes this apart in detail. Strategically, ISPE's Pharma 4.0 operating model frames this as a digital-maturity journey: converging IT and OT (the business and operational technology worlds), integrating architectures, and deliberately eliminating data silos rather than tolerating them [3]. This is the same integration imperative this book pursues: shared standards and ontologies so that a result in the LIMS and a tag in the historian can be recognized as describing the same real-world thing.

What does the fix look like in code? The companion open-source book makes the abstract batch_id real: the historian / time-series store and the LIMS and ELN layer each carry a batch_id column, and a contextualization step is what stitches the historian's sensor_reading, the lab's sample, and the batch-control batch_phase rows onto that one shared key — turning the manual bridges above into a relational join the database can perform on its own.

What the fix looks like in semantics: the shared key as a graph

A relational join solves the silo inside one plant's database; the ontology book solves it as a graph so the same identity survives across systems and across plants. The move is to make the batch a node and the five local names properties of that one node, then assert lineage as edges a machine can walk. The grammar is RDF (Resource Description Framework — the W3C model that records every fact as a subject-predicate-object triple), written here in Turtle (a compact text syntax for RDF) with bp: standing for the bioprocess vocabulary the classes-and-taxonomy chapter builds:

# One batch node; the five system-local names become properties of it.
bp:BATCH-2026-001  a bp:Batch ;
    bp:historianTagPrefix  "BR201_BATCH0156" ;   # the historian's name
    bp:mesBatchNumber      "BATCH_2024_0156" ;    # the MES's name
    bp:limsParentSample    "S-0156-001" ;          # the LIMS's name
    bp:erpLot              "LOT-22A-MABX" ;        # the ERP's name
    bp:derivedFrom         bp:WCB-CHO-001 .         # lineage, one walkable edge

Two ontology techniques then turn that node into something machine-checkable. First, PROV-O (the W3C provenance ontology) is the standard way to record where a result came from: a LIMS purity result is modeled as prov:wasDerivedFrom the batch and prov:wasGeneratedBy the assay activity, so genealogy is carried as data rather than asserted by hand — the same derivedFrom spine the relations-and-genealogy chapter makes transitive, so a reasoner walks from a finished vial back to the cell bank through hops no one stated. Second, a SHACL (Shapes Constraint Language) shape can gate the integration: where SQL NOT NULL only guards one table, a SHACL shape validates the assembled graph in a closed world, where a missing required link is a failure now, not an open question — exactly the release-gate discipline Book 4 uses for a lot's CQA panel. A reconciliation shape would demand that every bp:Batch carry exactly one of each system's identifier and a non-empty bp:derivedFrom:

# A batch is not integration-ready until every system's name is present and lineage is asserted.
bp:BatchReconciliationShape a sh:NodeShape ;
    sh:targetClass bp:Batch ;
    sh:property [ sh:path bp:mesBatchNumber ;   sh:minCount 1 ; sh:maxCount 1 ;
                  sh:message "Batch is missing its MES batch number." ] ;
    sh:property [ sh:path bp:limsParentSample ; sh:minCount 1 ;
                  sh:message "Batch carries no LIMS parent sample — the lab seam is broken." ] ;
    sh:property [ sh:path bp:derivedFrom ;      sh:minCount 1 ;
                  sh:message "Batch has no lineage edge — genealogy is unproven." ] .

A regulator's traceability request — "show me everything that made this vial" — is then a SPARQL (the standard query language for RDF) competency question (a question the model must be able to answer), run as a transitive walk rather than a human's recollection:

# CQ: every material the finished lot derives from, to any depth.
PREFIX bp: <https://example.org/bioproc#>
SELECT DISTINCT ?ancestor WHERE {
  bp:BATCH-2026-001 (bp:derivedFrom)+ ?ancestor .
}

This is the bridge the rest of the series formalizes — the shared standards and ontologies chapter on the data side, and Book 4's specification-and-ORSD where these competency questions become executable acceptance tests. The point for this chapter is narrow: the missing shared key is not just a database column, it is a modeling decision, and once the five names hang off one node the silo becomes a query.

The unsolved challenge: batch-identifier reconciliation across system boundaries

It would be comforting to end by saying OPC UA and Pharma 4.0 have solved this. They have not. What remains genuinely hard is automatic batch-identifier reconciliation: there is still no industry-wide mechanism that, given the historian's BR201_BATCH0156, automatically resolves the matching MES BATCH_2024_0156, the LIMS parent sample S-0156-001, and the ERP lot LOT-22A-MABX without a human asserting the links. OPC UA can carry meaning across a connection, and ISA-95 can describe who owns what, but neither mints a shared key across systems that were never designed to share one.

The consequence is that FAIR interoperability and end-to-end batch traceability are, in most plants today, achieved by manual reconciliation rather than by design — and manual reconciliation is exactly the failure mode that regulators scrutinize. FDA data-integrity guidance expects records to remain attributable and reconstructable across their whole lifecycle, which is hardest to demonstrate precisely at these boundaries [5]. And the digital-twins literature is explicit that an integrated data layer — not more sensors, not more storage — is the missing prerequisite; the data already exists, but it is not connected [9]. Closing this gap is less a technology problem than a discipline problem: it requires master-data governance to designate a canonical batch identifier and bind every system's local name to it, which is why this book keeps returning to governance, semantics, and shared ontologies rather than to any single product. This same hand-off — the physical batch record reconciled into a regulator-ready, traceable whole — is the data backbone behind the manufacturing book's quality, regulatory, and data chapter.

Why it matters

The single most useful thing to carry out of this chapter is that there is no "plant database." There is a federation of systems of record — historian, MES, LIMS, ELN, ERP — each genuinely authoritative for its own slice and genuinely separate from the rest. That is not a flaw to be engineered away; it is the structure of the industry, and every one of those systems exists separately for good reasons of scale, regulation, and vendor history. The lesson for data management is where the value and the risk actually live: not inside any one box, but at the seams between them. A plant can be drowning in correctly captured, fully compliant data and still be unable to answer "which culture conditions gave the highest-purity batch?" — because the historian, the MES, and the LIMS each hold one third of the answer under three different names for the same batch. Knowing this reframes the whole job: the work that pays off is rarely buying a better historian; it is designing, governing, and maintaining the joins — the shared identity that lets the five names be recognized as one batch. Every later chapter on architecture, governance, semantics, and ontologies is, at bottom, an attempt to build that join by design rather than by hand.

In the real world

This federation is not a textbook abstraction — it is the literal vendor list of a working biologics plant. A typical facility runs an AVEVA PI (OSIsoft) or GE Proficy historian, a Körber Werum PAS-X or Siemens Opcenter MES/EBR, a LabVantage or Thermo SampleManager LIMS, a Benchling or IDBS ELN, an SAP ERP, and a Waters Empower CDS — six or more systems from six different vendors, each validated under GAMP 5 [6], each speaking its own dialect. Stitching them together is, in most plants today, a job done by people and spreadsheets: an analyst pastes a historian tag prefix into a search box, an operator transcribes a batch number, a reviewer reconciles a sample ID against a lot by hand. Two places make this concrete. First, review by exception is the routine that keeps batch release tractable — the EBR auto-confirms the thousands of in-limit steps and surfaces only the handful of flagged exceptions a human must judge, and a lot cannot be dispositioned until every one is closed. Second, inspection is where the seams are stress-tested: an FDA or EMA inspector asks you to trace a specific finished vial back to the bioreactor run, the media lots, and the QC methods that made it, and the speed and credibility of that answer depend entirely on whether the genealogy was carried as data or lives in someone's memory. The plants that have invested in master-data governance and a shared batch identity answer in minutes; the rest answer in days of manual reconciliation — and the difference is exactly the integration discipline this book is about.

Key terms

Process historian — a database optimized for storing high-volume time-series data, using compression to retain the signal cheaply [5].
Tag — a named channel in the historian for one measured point, producing many timestamped values.
Time-series data — measurements indexed by the exact time they were taken.
MES (Manufacturing Execution System) — the manufacturing-operations-level system that governs batch execution [1]; the recipe it enforces is structured by the ISA-88 batch standard.
EBR (electronic batch record) — the digital, signed record of how a specific batch was made.
Master recipe — the approved template defining how a product is manufactured.
Review by exception — the EBR auto-confirms every in-tolerance step against its parametric limits and surfaces only the out-of-tolerance steps for human review, so attention goes to the flagged exceptions rather than to every compliant entry.
Predicate rule — the underlying GMP regulation (for a batch record, 21 CFR Part 211 §§211.186/211.188) that requires a record to exist; 21 CFR Part 11 governs that record once it is electronic [12].
Managed vs. unplanned deviation — a deviation anticipated and governed by an existing procedure (managed/planned) versus one that was not, which opens a formal deviation record and, where warranted, a CAPA.
CAPA (corrective and preventive action) — the formal record opened to investigate and remediate an unplanned deviation; it must be closed before the lot can be released.
Batch disposition — the Quality Unit's release/reject/hold decision on a lot, gated on every flagged exception being closed.
System of record — the single authoritative source for a given slice of data.
LIMS (Laboratory Information Management System) — the system that tracks samples, tests, specs, and QC results.
ELN (Electronic Lab Notebook) — the digital notebook capturing exploratory experiments and reasoning.
CDS (Chromatography Data System) — software that acquires and processes chromatography data [5].
BMS / EMS — Building / Environmental Monitoring Systems that record cleanroom conditions.
ERP (Enterprise Resource Planning) — the enterprise-level system for materials, inventory, and orders.
GxP — the umbrella for the "good practice" quality regulations (Good Manufacturing/Laboratory/Clinical Practice and kin) that govern any inspectable record.
TSDB (time-series database) — a database purpose-built for high-rate timestamped data; open-source TSDBs (e.g., TimescaleDB) can serve as the historian engine.
Audit trail — a secure, time-stamped record of who changed what and when, required by 21 CFR Part 11 [4].
ALCOA+ — data should be Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, and Available [5].
Data silo — valuable data trapped in a system that can't easily share it; the opposite of FAIR [2].
OPC UA — a vendor-neutral standard for exchanging data and its meaning across system boundaries [7].
Batch-identifier reconciliation — linking the several names one batch carries across the historian, MES, LIMS, ELN, and ERP back to a single shared identity; still largely manual today.
Integrated data layer — a connected layer that resolves every system's local identifiers onto one shared key, named in the digital-twins literature as the prerequisite for breaking silos [9].
RDF / Turtle / triple — the W3C model that records every fact as a subject-predicate-object triple (Turtle is its compact text syntax), used to make a batch a node whose five system-local names are properties of it.
PROV-O — the W3C provenance ontology; prov:wasDerivedFrom / prov:wasGeneratedBy carry a result's genealogy as data rather than as a human's assertion.
SHACL (Shapes Constraint Language) — a language whose shapes validate the assembled graph in a closed world, where a missing required link (e.g., an absent lineage edge) is a failure now, not an open question.
SPARQL competency question — a question the model must answer, run as a query over the RDF graph; a transitive derivedFrom walk answers a regulator's traceability request automatically.
Grouped (leave-one-batch-out) cross-validation — holding whole batches, not individual rows, out of training, so a model's score measures generalization to an unseen batch; impossible without a trustworthy shared batch_id.
Data leakage — information from the evaluation set bleeding into training (e.g., a batch split across train and test), which inflates a model's reported score above what production will deliver.
Applicability domain — a check of whether a new input lies inside the envelope of conditions a model was trained on, so an out-of-range input is flagged rather than silently extrapolated.
Drift detection — watching a deployed model's inputs or errors wander over time, distinguishing genuine process drift from a data-pipeline break at a system seam.
Pooling-window cut points — the two thresholds, set against a Protein A column's live UV280 trace, between which the eluted antibody is collected as product and outside which it goes to waste.

Where this leads

We now have the cast of systems and a hard look at the seams between them. The next chapter, Architecture and Integration: ISA-95, OT/IT, and the Edge-to-Cloud Path, gives the constellation a map: the ISA-95 / Purdue hierarchy that organizes everything from Level 0 sensors to Level 4 enterprise, the convergence of operational and information technology, the contextualization layer that finally lets these systems agree on what they mean, and the modern path that carries plant data from the edge all the way to the cloud.

What this chapter covers​

The four systems, one at a time​

The historian: time-series identity and tag nomenclature​

The MES batch record: the official as-executed sequence​

Review by exception: how the EBR triages itself​

LIMS and ELN: when a batch becomes many samples​

The supporting cast​

How the pieces fit — and where they don't​

Anatomy of one batch's identity​

The identifier gap: five names for one thing​

Why the seams matter: the integration-cost question​

Why the seams matter for machine learning: the batch is the unit of trust​

Standards that could fix it: semantics and batch reconciliation​

What the fix looks like in semantics: the shared key as a graph​

The unsolved challenge: batch-identifier reconciliation across system boundaries​

Why it matters​

In the real world​

Key terms​

Where this leads​