Skip to main content

Connecting Legacy & Commercial Skids: Modbus, Siemens S7, PLC4X

๐Ÿ“ Where we are: Part II ยท Capturing the Process โ€” Chapter 9. The clean OPC UA bioreactor of Chapters 5โ€“7 is the exception, not the rule. This chapter reaches the other equipment on the floor โ€” the harvest centrifuge, the TFF skid, the balance โ€” that speak older, insecure protocols, and reads them safely from behind OT segmentation into the same tag namespace.

The simple version

OPC UA is a polite, self-describing protocol: ask a server "what do you have?" and it tells you, with names and units and security. Modbus is the opposite. A Modbus device is a numbered cupboard of 16-bit pigeonholes. Pigeonhole 1 holds 1850. That is all it tells you. Whether 1850 means 1.850 bar, 1850 millibar, or 18.50 of something is written nowhere on the wire โ€” it lives in a PDF datasheet on someone's desk. There is no password on the cupboard and no lock on the door. So we do two things: we put the cupboard inside a guarded room (network segmentation), and we keep our own labelled key-ring (the tag dictionary) that says "pigeonhole 1, multiply by 0.001, call it transmembrane pressure in bar." This chapter builds that key-ring in real, runnable code.

What this chapter coversโ€‹

By now we can capture a modern bioreactor cleanly. But walk any real biomanufacturing floor and most of the equipment is older than OPC UA: a centrifuge from a 2009 install, a tangential-flow-filtration (TFF) skid whose PLC was specified before "cybersecurity" was a procurement line item, a bench balance with a serial port. These speak Modbus and Siemens S7 โ€” protocols designed for a trusted wire, with no authentication and no encryption.

We will:

  • read a (simulated) TFF skid over Modbus TCP with PyModbus, and scale its raw integer registers into the same engineering-unit tags the rest of the platform uses;
  • face the brutal truth that Modbus and S7 carry no security at all โ€” and what the Siemens S7-1200/1500 PUT/GET caveat really means;
  • treat OT network segmentation as the compensating control that lets us touch these devices at all, and write it down as a risk-based decision;
  • and look at how python-snap7 and Apache PLC4X extend the same pattern to Siemens PLCs and mixed legacy fleets.

The runnable code in this chapter is examples/chapters/09-legacy-skids-modbus-s7/modbus_reader.py. The S7 and PLC4X material is shown as realistic, clearly-labelled configuration โ€” there is no Siemens PLC inside a laptop, and we will be honest about exactly where the simulation stops.

Why legacy protocols are a different animalโ€‹

Modbus was published in 1979 and standardized as a simple request/reply, client/server messaging protocol: a client sends a function code (read holding registers, write a coil) and an address, and the server replies with 16-bit register values or single-bit coils [1]. There is nowhere in the frame for a username, a password, a signature, or a session key. None. A device on the wire does whatever any client asks. That is not a bug you can patch โ€” it is the protocol.

The Modbus Organization eventually acknowledged this and published a Modbus/TCP Security specification that wraps the protocol in TLS with X.509 client certificates โ€” but it is a later, optional variant, and the skids we are reaching in this chapter predate it and will never support it [2]. So the honest engineering position is: we cannot make the protocol secure, so we make the network around it secure, and we read the device through a validated edge layer that owns the meaning the device itself does not carry.

Siemens S7comm (and its newer S7CommPlus) is the same story in a different dialect. It rides a TPKT/COTP/S7comm stack over TCP and, like Modbus, was built for a trusted automation network. Security researchers have demonstrated that S7CommPlus's integrity mechanism can be defeated with replay and injection attacks โ€” its "anti-replay" is not robust authentication [9]. The practical takeaway for a data engineer is identical to Modbus: do not rely on the protocol to protect itself; segment, and read it under a controlled layer.

Reading a Modbus skid for realโ€‹

Here is the device, exactly as a legacy PLC presents it. From examples/chapters/09-legacy-skids-modbus-s7/modbus_reader.py:

# raw holding registers the skid PLC exposes (scaled integers, as legacy PLCs do).
# Modbus holding registers are conventionally numbered 40001.., here at addresses 1-4.
TFF_RAW = [1850, 320, 1240, 78] # TMP, flux, conductivity, recovery
# position -> (tag, scale, unit) โ€” the normalization the edge gateway applies
SCALING = [
("TFF01.TMP.PV", 0.001, "bar"), # 1850 -> 1.850 bar
("TFF01.Flux.PV", 0.1, "LMH"), # 320 -> 32.0 LMH
("TFF01.Cond.PV", 0.01, "mS/cm"), # 1240 -> 12.40 mS/cm
("TFF01.Recovery.PV", 1.0, "%"), # 78 -> 78 %
]

Stare at TFF_RAW for a moment, because it is the whole problem in four integers. 1850 is not 1.850 bar to the device; it is just the number 1850. Legacy PLCs almost never store floating-point engineering units โ€” memory and fieldbus bandwidth were precious, so values are stored as scaled integers: pressure ร—1000, flux ร—10, conductivity ร—100. The scale factor is a convention agreed in a register map, not anything the protocol announces. If the SCALING table is wrong by one decimal place, your transmembrane pressure reads 18.5 bar instead of 1.85, and nothing on the wire will tell you.

That SCALING table is the chapter's quiet hero. It is the bridge from a raw register to the UNS tag namespace we designed in Chapter 4: TFF01.TMP.PV is the canonical name for the tangential-flow-filtration skid's transmembrane-pressure process value, in bar. The legacy device knows none of that; the edge layer supplies it. In production this exact mapping lives in gov.tag_dictionary, generated and linted in Chapter 4, so register-to-tag scaling is reviewed configuration, not a magic number buried in a script.

The scaling itself is a six-line function โ€” deliberately dumb, easy to test, easy to validate:

def scale(registers: list[int]) -> dict[str, dict]:
out = {}
for value, (tag, factor, unit) in zip(registers, SCALING):
out[tag] = {"value": round(value * factor, 3), "unit": unit}
return out

And here is the part that talks to a real device โ€” the actual PyModbus client call an edge collector makes. PyModbus is a full open-source Modbus client/server for both TCP and serial RTU/ASCII, which is why it is the tool of choice for reaching these skids from a validated edge layer [3]:

async def read_skid(host: str = "127.0.0.1", port: int = 502, unit: int = 1) -> dict:
"""Read a real TFF skid over Modbus TCP and normalize to engineering units.

This is the actual pymodbus client call an edge collector makes; point it at
a real skid (or the repo's mock) by host/port. Holding registers are read
starting at address 0 (40001) โ€” confirm your device's base with its map.
"""
from pymodbus.client import AsyncModbusTcpClient

client = AsyncModbusTcpClient(host, port=port)
await client.connect()
rr = await client.read_holding_registers(address=0, count=len(TFF_RAW), device_id=unit)
client.close()
if rr.isError():
raise OSError(f"Modbus read failed: {rr}")
return scale(list(rr.registers))

Three details in that short function are worth your attention, because each is a classic legacy-integration trap.

The address=0 is the notorious off-by-one of the Modbus world. Holding registers are documented as 40001, 40002, 40003โ€ฆ but the actual on-wire protocol address is zero-based, so register "40001" is read at address 0. Vendors disagree about whether their datasheet means the documentation number or the wire number, so the comment in the code โ€” confirm your device's base with its map โ€” is not boilerplate; it is the single most common reason a first read returns garbage.

The device_id=unit (the Modbus "unit ID" or slave address) matters because one Modbus/TCP gateway often fronts several serial devices daisy-chained behind it; the unit ID picks which physical box answers. And the explicit rr.isError() check exists because Modbus has no quality flag the way OPC UA does (recall the Good/Uncertain/Bad status codes from Chapter 7). A Modbus read either returns numbers or returns an exception response, and it is on us โ€” the edge layer โ€” to turn that into something the historian can trust. Honesty about data quality is something we have to add; the protocol will not.

Running it with no hardwareโ€‹

Because there is no TFF skid inside a laptop, the file ships a demo() that applies the same scaling to the known register snapshot, so the chapter is runnable end to end with zero hardware and zero network:

def demo() -> dict:
"""Apply the engineering-unit scaling to a known register snapshot (no network)."""
return scale(TFF_RAW)


if __name__ == "__main__":
for tag, v in demo().items():
print(f" {tag:20} = {v['value']} {v['unit']}")

Run it:

$ python chapters/09-legacy-skids-modbus-s7/modbus_reader.py
TFF01.TMP.PV = 1.85 bar
TFF01.Flux.PV = 32.0 LMH
TFF01.Cond.PV = 12.4 mS/cm
TFF01.Recovery.PV = 78.0 %

Those four lines are the entire point of the chapter made concrete. Four meaningless integers went in; four named, scaled, unit-stamped readings came out, in exactly the ASSET.Measurement.PV shape the rest of the platform speaks. A transmembrane pressure of 1.85 bar and a flux of 32 LMH (litres per square metre per hour) are sensible numbers for a mAb TFF step โ€” and now they can flow into ts.sensor_reading and be contextualized to a batch and phase like any OPC UA tag. Be honest about what this proves: it exercises the integration logic and the data shapes, not a specific vendor's Modbus quirks. The read_skid() path is the real client call; you point it at an actual skid (or the repo's Modbus mock) by host and port, and the same scale() runs on what comes back.

A legacy TFF skid PLC exposes four raw integer holding registers; an arrow carries them into an edge collector running modbus_reader.py, where a scaling table multiplies each by its factor and attaches a UNS tag name and unit, emitting four named engineering-unit readings into the historian. The whole legacy side sits inside a boxed, segmented OT zone.

From numbered pigeonholes to named records: a legacy Modbus skid carries only raw scaled integers, so the validated edge layer supplies the scale factor, unit, and canonical tag name that the protocol omits โ€” all from behind OT segmentation. Original diagram by the authors, created with AI assistance.

Siemens S7 and the PUT/GET trapโ€‹

Plenty of commercial skids are built on Siemens S7 PLCs rather than Modbus. The open-source door here is python-snap7, a pure-Python S7 library that implements the TPKT/COTP/S7comm/S7CommPlus stack and can read S7-300/400/1200/1500 controllers natively [4]. The pattern mirrors read_skid(): connect, read a chunk of a data block, then byte-decode and scale it into tags. The following is an illustrative snippet โ€” there is no Siemens PLC on a laptop, so this is the shape of the call, not a tested run:

# Illustrative โ€” requires a real/simulated S7 PLC; not run on a laptop.
import snap7
from snap7.util import get_int

client = snap7.client.Client()
client.connect("192.0.2.50", rack=0, slot=1) # S7-1500: rack 0, slot 1
db = client.db_read(db_number=10, start=0, size=8) # read 8 bytes of DB10
tmp_raw = get_int(db, 0) # offset 0 -> TMP scaled int
client.disconnect()

But there is a specific, infamous gotcha you must know before you wire an S7 PLC into anything. On modern S7-1200 and S7-1500 controllers, snap7's "optimized" data-block access depends on the PLC's PUT/GET communication setting and on data blocks not being marked "optimized block access" in TIA Portal. If PUT/GET is disabled (it is off by default on these families for safety) your reads fail outright; if an automation engineer turns it on so your collector can read, they have just opened a door that โ€” combined with S7comm's weak authentication โ€” lets any client read and write the PLC [9]. That single checkbox is a documented, risk-based decision, not a convenience toggle: enabling it to feed a historian is exactly the kind of trade-off that belongs in a change record and a network-segmentation justification, never something flipped quietly in the field.

One library for a mixed fleet: Apache PLC4Xโ€‹

A real plant is rarely one protocol. You will meet Modbus and S7 and Allen-Bradley in the same harvest suite, and writing a bespoke client for each is how integration projects rot. Apache PLC4X offers a single, shared API behind per-protocol drivers, so the same connection-string-and-read code reaches a Siemens S7 over TCP or a Modbus device over TCP/RTU/ASCII without your application caring which [5]. The Modbus driver addresses coils, discrete inputs, holding registers, and input registers under that common API; the S7 driver speaks to the S7-300/400/1200/1500 line [6].

In practice you express the fleet as configuration. The block below is an illustrative PLC4X-style connection map for our two legacy assets โ€” the kind of edge/plc4x/plc4x-connect.yaml an edge service would consume โ€” not a tested artifact in this chapter's directory:

# Illustrative PLC4X connection map (not a tested artifact in this chapter dir).
connections:
tff01:
url: "modbus-tcp://10.20.0.11:502?unit-identifier=1"
poll_ms: 1000
tags:
TFF01.TMP.PV: { address: "holding-register:1:INT", scale: 0.001, unit: "bar" }
TFF01.Flux.PV: { address: "holding-register:2:INT", scale: 0.1, unit: "LMH" }
centrifuge01:
url: "s7://10.20.0.21?remote-rack=0&remote-slot=1"
poll_ms: 2000
tags:
CFG01.Speed.PV: { address: "%DB10.DBW0:INT", scale: 1.0, unit: "rpm" }

Notice that the scale and unit keys reappear, now per protocol โ€” the legacy meaning problem never goes away; PLC4X just gives you one consistent place to keep the key-ring. The honest trade-off: PLC4X is a powerful, Apache-licensed Java/Go project, but it is heavier than a 60-line PyModbus script and its protocol-driver maturity varies by device. For one Modbus skid, PyModbus is right. For a fleet of mixed legacy controllers feeding one edge service, PLC4X earns its weight.

Why it mattersโ€‹

Legacy integration is where data-integrity ambitions meet the actual floor. Every ALCOA+ attribute we have championed โ€” attributable, accurate, contemporaneous โ€” has to survive a protocol that volunteers none of them. Modbus will not tell you the unit, will not timestamp the value at source, and will not flag bad data. If the edge layer scales 1850 wrong, the historian faithfully and permanently records an accurate-looking but wrong transmembrane pressure, and a process-validation reviewer downstream has no way to see the error. The scaling table is therefore not plumbing; it is a data-integrity control, and it deserves the review, version control, and qualification any GMP control gets.

It also matters because you usually cannot rip-and-replace this equipment. A qualified TFF skid or centrifuge represents years of validation; "we'll buy OPC UA-native gear" is rarely a real option for an existing line. So the data engineer's job is not to wish the legacy protocol away โ€” it is to read it honestly and wrap it in the controls the protocol lacks.

In the real worldโ€‹

The blunt reality is that insecure legacy protocols are everywhere in pharma, and the regulatory and security frameworks already expect you to compensate at the network layer. NIST SP 800-82 Rev. 3, the authoritative OT-security guide, is built around exactly this: zone-and-conduit architectures, network segmentation, and compensating controls for protocols that cannot defend themselves [7]. IEC 62443-3-3 makes it a requirement, not advice: its foundational requirement FR5 (Restricted Data Flow), including SR 5.1 Network Segmentation, mandates segmenting insecure OT by security level into zones connected only through controlled conduits [8]. So when our read_skid() reaches a Modbus device, it does so from inside a defined conduit โ€” the edge collector sits in a controlled zone, the skid sits in an OT zone, and the only traffic between them is the specific Modbus read on the specific port we documented.

Crucially, choosing segmentation as the answer is itself a documented, risk-based decision. The FDA's Computer Software Assurance guidance frames assurance for production and quality-system software around intended use and risk: you are expected to identify that the protocol is insecure, decide that network controls plus a validated edge layer are the proportionate mitigation, and write that reasoning down [10]. "We segmented the Modbus skids and read them through a qualified gateway because the protocol has no authentication" is precisely the kind of risk-based statement an inspector wants to see โ€” and precisely what this chapter's code makes concrete.

Now the honest OSS-vs-commercial line. The reading is genuinely, completely solved in open source: PyModbus, python-snap7, and PLC4X will talk to almost any legacy controller, at no licence cost, with code you can read and test. What pure OSS does not give you is the validated-driver accountability a commercial historian's connector ships with (AVEVA PI's interfaces and connectors, Kepware/KEPServerEX, and the like come with vendor qualification packages and support contracts that name a throat to choke). With our PyModbus collector, you own proving the scaling is correct, that reads are reliable, and that the segmentation holds โ€” which is the recurring shape of this book: open source reaches the device cleanly; the GxP wrapper around it is yours to build or buy. And no amount of either side changes the protocol: Modbus and S7 stay insecure, and the network is the only place that gets fixed.

Key termsโ€‹

  • Modbus โ€” a 1979 request/reply client/server protocol using function codes and 16-bit registers/coils, with no authentication or encryption; common on legacy skids, balances, and pumps.
  • Holding register โ€” a 16-bit read/write memory slot in a Modbus device, conventionally numbered from 40001 but addressed from 0 on the wire; legacy PLCs store engineering values here as scaled integers.
  • Scaled integer โ€” a value stored as a whole number times a fixed factor (e.g. pressure ร—1000) because the device cannot or does not store floating point; the edge layer must apply the scale to recover engineering units.
  • Unit ID / device_id โ€” the Modbus slave address that selects which physical device answers behind a shared TCP gateway.
  • Siemens S7comm / S7CommPlus โ€” Siemens' proprietary PLC protocol stack (over TPKT/COTP/TCP); insecure by design, with weak authentication that has been defeated by replay/injection.
  • PUT/GET communication โ€” a Siemens S7-1200/1500 setting that must be enabled (and optimized block access disabled) for external clients like snap7 to read data blocks; off by default, and enabling it widens the attack surface.
  • OT network segmentation โ€” isolating operational-technology equipment into zones connected only through controlled conduits; the IEC 62443 / NIST SP 800-82 compensating control for insecure legacy protocols.
  • Zones and conduits โ€” the IEC 62443 model of grouping assets by security level (zones) and permitting traffic only through defined, controlled paths (conduits).

Where this leadsโ€‹

We have reached the messy edges of the upstream world โ€” the legacy skids that speak in numbered pigeonholes โ€” and pulled them into the same tag namespace as everything else, from behind segmentation, with code you can run. Next we follow the product downstream. In Chapter 10 โ€” Downstream Capture: Chromatography & Filtration Skids, we capture the Protein A capture cycle and the filtration trains that polish and concentrate the antibody, turning a chromatogram's load/wash/elute/strip phases into events.operation_event rows โ€” the moment the harvest becomes drug substance, and the moment the time-series we have been collecting starts to tell a purification story.