TL;DR:

  • IoT data pipelines have four layers: ingestion (MQTT/Kafka), processing (stream or batch), storage (time-series DB), and visualisation
  • Time-series databases like InfluxDB and TimescaleDB outperform general-purpose databases for sensor data by 10–100x on write throughput and query speed
  • Stream processing with Kafka Streams or Apache Flink handles real-time anomaly detection; batch processing with Spark handles historical analysis

IoT data pipeline architecture is where most projects accumulate technical debt. A Raspberry Pi writing sensor readings directly to a Postgres table works for a proof of concept but falls apart at 1,000 devices and 10,000 readings per minute. Designing the pipeline correctly from the start — separating ingestion, processing, storage, and visualisation into distinct layers — makes the difference between a system that scales and one that requires an emergency rewrite at 10x load.

Layer 1: Ingestion

The ingestion layer receives raw data from devices and makes it available to downstream processors. Two protocols dominate.

MQTT is the standard for constrained IoT devices — it’s lightweight, handles unreliable connections, and every platform has a client library. Devices publish to topic hierarchies like factory/line-1/sensor/42/temperature. A broker (Mosquitto for small deployments, EMQX for production scale) receives and routes messages.

Apache Kafka is the standard for the edge-to-cloud boundary and high-throughput pipelines. Where MQTT is optimised for device-to-broker communication, Kafka is optimised for broker-to-processor communication. A common pattern is an MQTT-to-Kafka bridge: an EMQX rule engine or a Node.js bridge process subscribes to MQTT topics and publishes to Kafka topics. Kafka’s durable log retention means downstream consumers can replay data, recover from failures, and operate independently.

For lower-volume deployments (under 10,000 messages per second), MQTT alone is sufficient — the broker delivers to multiple subscribers, including stream processors and database writers. At higher volumes or when you need replay and at-least-once semantics, add Kafka.

Layer 2: Processing

Processing transforms raw device data into meaningful signals. Two models apply depending on latency requirements.

Stream processing operates on data as it arrives — sub-second latency for real-time alerting, feature computation, and anomaly detection.

Kafka Streams (Java/Kotlin) is embedded in your application, no separate cluster needed. Good for stateful aggregations (sliding window averages, count-based triggers) within a Kafka topology. Apache Flink is a separate cluster for complex event processing — handles out-of-order events with watermarks, supports exactly-once semantics, and scales to millions of events per second. KSQL / Confluent ksqlDB offers SQL-like stream processing on Kafka topics without writing Java, good for teams more comfortable with SQL than stream processing concepts.

A common edge stream processing pattern: compute a 60-second rolling average of temperature; emit an alert event if the value exceeds a threshold. The alert event goes to a separate Kafka topic triggering a notification workflow. The averaged value goes to the time-series database.

Batch processing operates on historical data — hourly or daily jobs for trend analysis, model retraining data preparation, and compliance reporting. Apache Spark or dbt on a data warehouse handles this tier. At the edge, batch jobs often run as scheduled Kubernetes CronJobs or systemd timers.

Layer 3: Storage

General-purpose relational databases fail at scale for time-series data. Standard B-tree indexes don’t handle append-heavy, time-ordered workloads efficiently, and table rows with many columns waste storage on sparse sensor data.

Time-series databases are purpose-built for this pattern:

DatabaseArchitectureBest ForWrite Throughput
InfluxDBColumnar, custom TSM engineIoT telemetry, metrics, events~500K points/sec per node
TimescaleDBPostgreSQL extensionSQL-familiar teams, relational + time-series~300K rows/sec
QuestDBSIMD-accelerated columnarHigh-frequency data, analytics queries~1M rows/sec
OpenTSDBHBase-backedVery large scale, existing HBase infraHigh, with HBase cluster

InfluxDB 3.x (rewritten in Rust with Apache Arrow/Parquet) is the current leader for pure IoT telemetry. The line protocol is simple to write and the SQL query interface handles downsampling and continuous queries well. InfluxDB’s tasks feature handles automated data rollups: keep full-resolution data for 30 days, 1-minute aggregates for 1 year, hourly aggregates indefinitely.

TimescaleDB is the right choice when you’re already on PostgreSQL and want time-series performance without a new database to operate. The time_bucket() function and hypertable auto-partitioning give you 10–100x better performance than raw PostgreSQL for time-series queries — and if your team already knows Postgres, the operational overhead is minimal.

Data retention policies are essential from day one. A factory with 1,000 sensors at 1Hz generates 86.4 million rows per day. Define tiered retention: raw data for 7–30 days, downsampled for 1 year, aggregates indefinitely.

Layer 4: Visualisation and Cloud Integration

Grafana is the default visualisation layer for IoT pipelines — it has native InfluxDB and TimescaleDB data sources, supports real-time streaming dashboards, and can be self-hosted at the edge for air-gapped environments. Most UK engineering teams working in this space have landed on Grafana, and the community around it is excellent.

Cloud integration patterns: scheduled jobs can export aggregated data to S3, Azure Blob, or GCS in Parquet format, with cloud data warehouses querying it for historical analysis. InfluxDB’s replication or TimescaleDB’s logical replication pushes data to cloud in near-real-time. Or the edge stream processor can forward only anomalies and summaries to cloud IoT platforms (AWS IoT Core, Azure IoT Hub), keeping bandwidth costs bounded.

The Bottom Line

Build the pipeline in layers from the start: MQTT for device ingestion, Kafka for durability and fan-out, a stream processor for real-time alerting, a time-series database for storage, and Grafana for visualisation. The most common mistake is writing directly from MQTT to a relational database — it works until it doesn’t, at exactly the moment your deployment is large enough that rewriting is painful.