From Shop Floor to Cloud: Building a Real-Time Manufacturing Data Pipeline

Building a real-time manufacturing data pipeline — getting live production data from PLCs and sensors on the factory floor into cloud analytics infrastructure — sounds straightforward on a whiteboard. In practice, it involves navigating a stack of protocols, hardware, networking, and security constraints that can turn a "simple" project into a multi-quarter initiative. This guide is for engineers and operations leaders who want to understand what the architecture actually looks like — and where the complexity lives.

The Stack, Layer by Layer

Layer 1: Data acquisition at the source. This is where most pipelines get complicated, because "the factory floor" is not a homogeneous data environment. A typical mid-size plant has PLCs speaking multiple protocols: older machines on Modbus RTU or Profibus, newer machines on EtherNet/IP or OPC-UA, SCADA systems with proprietary historian interfaces. The data acquisition layer needs to speak all of these languages and normalize the output into a consistent format.

The two main approaches are: (1) native protocol connectors running on an edge gateway device co-located with the PLC, or (2) reading from an existing SCADA historian if one is in place. Option 1 gives you access to raw PLC data at high resolution (sub-second). Option 2 is faster to implement but is limited to whatever data the historian was already configured to capture, often at coarser resolution (1–5 second samples) and without process variables that weren't deemed historian-worthy when the historian was set up.

For AI applications, option 1 is almost always preferable. The patterns that predict failures often involve subtle, short-duration variations in process parameters — variations that get averaged away in 5-second historian samples.

Layer 2: Edge compute and preprocessing. Raw PLC data is useful but noisy. Before transmitting data upstream, the edge layer should handle: timestamp normalization (PLCs often have poor clock synchronization — edge devices need to apply consistent UTC timestamps), dead-band filtering (only transmit values that have changed by more than a configured threshold, to avoid flooding the pipeline with static readings), and data buffering (store-and-forward capability for when connectivity is unavailable).

Edge compute also handles protocol bridging — translating from the native industrial protocol to a cloud-friendly transport protocol. MQTT is the near-universal choice for this layer: it's lightweight, supports quality-of-service guarantees for message delivery, and has mature client libraries for every language. HTTPS REST is simpler but less efficient for high-frequency streaming.

Layer 3: Secure transport. Getting data from the OT (operational technology) network to the cloud requires crossing the IT/OT boundary — one of the most sensitive seams in industrial network architecture. The two main approaches are: (1) a dedicated outbound MQTT/HTTPS connection through the corporate firewall (requires IT approval and firewall rule changes), or (2) a cellular-connected edge device that bypasses the plant network entirely.

Cellular is often faster to deploy because it avoids the IT approval process, but it has implications for data governance and cost. For plants processing high data volumes (multiple PLCs at 100ms sampling), cellular data costs can be significant. For moderate volumes, cellular is often the fastest path to live data.

Security requirements: all transport should be encrypted (TLS 1.2+), all connections should be authenticated (mutual TLS or token-based), and the connection should be outbound-only from the plant network. No inbound connections should be opened to the OT network.

Layer 4: Cloud ingestion. Data arriving at the cloud side needs to be ingested at scale with low latency. The two main patterns are stream processing (Apache Kafka, AWS Kinesis, Azure Event Hubs) for high-volume real-time workloads, and time-series databases (InfluxDB, TimescaleDB, AWS Timestream) for storage and query. A production pipeline typically uses both: stream processing for real-time alerting and AI inference, time-series database for historical analytics and model training.

Layer 5: AI model deployment. With clean, normalized, real-time data available in the cloud tier, AI models can be deployed for monitoring and inference. The specific architecture depends on latency requirements: models that need to respond in under a second (real-time quality inspection, safety monitoring) need to run at the edge. Models with more relaxed latency requirements (failure probability scoring, optimization recommendations) can run in the cloud.

The Pitfalls That Catch Teams Off Guard

Clock synchronization. PLCs don't have accurate system clocks. They drift. Two PLCs on the same production line may report timestamps that are 30 seconds apart for events that happened simultaneously. Without correcting for this, any analysis that correlates data from multiple sources is working with corrupted timestamps — and the corruption is silent. Edge devices should apply NTP-synchronized UTC timestamps at the point of collection, not trust the PLC's internal clock.

Network brownouts. Plant networks, especially in older facilities, are not designed for continuous high-frequency data transmission. Periodic network brownouts — brief periods of packet loss or high latency — can corrupt streaming pipelines in subtle ways. Store-and-forward buffering at the edge, combined with idempotent write handling on the cloud side, is essential for data integrity.

Schema drift. When a PLC program is updated — a common occurrence as production processes change — the register map may change. If your data pipeline is reading specific register addresses and those addresses are remapped in a PLC update, your data is now silently wrong. Robust pipelines have monitoring for unexpected value changes that could indicate schema drift.

IT/OT culture conflicts. The biggest non-technical pitfall. IT teams are focused on security and change management; OT teams are focused on uptime and production continuity. These priorities conflict in predictable ways when you're trying to add new connections to the production network. Getting both teams aligned early — before hardware is installed, not after — is the single most important factor in deployment timeline.

What This Looks Like With SensorBridge

The architecture described above is exactly what our SensorBridge module implements. The edge gateway handles native protocol connections and timestamp normalization. MQTT transport over cellular or corporate Ethernet handles secure data movement. Cloud ingestion handles normalization and routing to AI model pipelines. The configuration interface exposes the variables you care about without requiring you to know the underlying PLC register addresses.

For manufacturers who want to understand the architecture more deeply, we're happy to walk through a detailed technical review. For those who want to skip the complexity and get to AI insights, SensorBridge handles the infrastructure. Either approach is valid depending on how much control you want over the data layer.

From Shop Floor to Cloud: Building a Real-Time Manufacturing Data Pipeline

The Stack, Layer by Layer

The Pitfalls That Catch Teams Off Guard

What This Looks Like With SensorBridge

See SensorBridge™ in Action