Back to Blog
9 min read

Building a real-time analytics pipeline: architecture decisions that matter at scale

A client needed to move from daily batch reporting to live operational dashboards — processing 1.2 million events per day with an end-to-end latency target of under 200 milliseconds. This is the architecture we built, and the decisions we would make differently.

Real-time analytics is a loaded term. It covers everything from "refresh every 15 minutes" to "events reflected in dashboards within 100ms of occurrence." The architecture choices that serve these requirements are radically different, and conflating them is one of the primary sources of over-engineering in data platform projects.

For this engagement, the client had a genuine operational requirement: their operations team needed to see fulfilment events reflected in dashboards quickly enough to intervene on exceptions before they became customer-facing failures. The 200ms target was not aspirational — it was the time budget within which a supervisor could realistically act on an alert.

1.2M
events per day at peak
<200ms
end-to-end pipeline latency
$0.18
cost per million events (steady state)

Requirements that shaped every decision

Before selecting any technology, we documented the specific requirements that would determine architectural fitness:

  • Peak throughput: ~14 events per second average, spikes to ~120 events/second during shift changes
  • Event payload size: 0.8–3.2KB JSON
  • End-to-end latency: p95 under 200ms, p99 under 500ms
  • Data retention: 90 days of raw events, 2 years of aggregated metrics
  • Exactly-once semantics required for financial event types
  • At-least-once acceptable for operational events
  • Downstream consumers: real-time dashboard, ML feature store, compliance archive

These requirements ruled out several tempting architectural simplifications. A serverless event-driven approach (API Gateway → Lambda → DynamoDB Streams) would have hit cold-start latency problems under the p99 requirement. A managed service like Kinesis Data Firehose would have breached the 200ms target due to its minimum 60-second buffer.

Ingestion: Kafka over Kinesis

The first major decision was the ingestion layer. The serious candidates were Apache Kafka (self-managed on EKS), Amazon MSK (managed Kafka), and Amazon Kinesis Data Streams.

We chose Amazon MSK for the following reasons:

  • Latency — Kafka's end-to-end producer-to-consumer latency under realistic load is consistently 5–15ms. Kinesis's is 70–200ms. For a 200ms total budget, this headroom matters.
  • Exactly-once semantics — Kafka's transactional producer API provides genuine exactly-once delivery semantics for the financial event types. Kinesis requires implementing deduplication in the consumer layer.
  • Retention flexibility — Kafka topics can be configured with arbitrary retention periods. We used 7 days of raw retention as a replay buffer, which Kinesis supports only at significant additional cost.
  • Ecosystem — Kafka's connector ecosystem (Kafka Connect) gave us pre-built, production-tested sinks to our target stores, reducing custom connector code.

MSK rather than self-managed Kafka eliminated operational overhead at the cost of some configuration flexibility — a trade-off worth making at this scale.

Stream processing: Apache Flink for stateful operations

The pipeline required stateful aggregations — rolling windows, session groupings, and anomaly detection that needed memory of previous events. This ruled out stateless Lambda-based processing.

We deployed Apache Flink on Amazon EMR Serverless. Flink's state backends (RocksDB in this case) allowed us to maintain large in-memory aggregation state without externalising it to a database on every event — the key to hitting the latency target. The Flink jobs consumed from Kafka, applied the windowing and enrichment logic, and produced to two output topics: one for the real-time dashboard consumer, one for the storage layer.

Why not Spark Streaming?

Spark Streaming's micro-batch architecture introduces a minimum latency of one batch interval — typically 500ms to a few seconds. For our sub-200ms requirement, true record-at-a-time processing was required, which is Flink's native execution model.

Storage: dual-write to lake and warehouse

We dual-wrote processed events to two stores with different purposes:

  • S3 + Apache Iceberg (data lake) — Raw and processed events in Parquet format, partitioned by hour. Iceberg's time-travel queries and schema evolution support were essential for the compliance archive requirement. Cost: approximately $0.023 per GB per month.
  • Amazon Redshift (data warehouse) — Aggregated metrics tables updated via Redshift Streaming Ingestion from the Kafka output topics. The dashboard queries run against Redshift materialised views, not the raw event store. This separation is critical for performance: the dashboard queries are pre-computed and served from memory, not scanned from S3.

Dashboard delivery: materialised views plus WebSocket push

The final latency challenge was getting changes from Redshift into the dashboard. Polling — even at 1-second intervals — would add unacceptable latency and unnecessary query load. We built a thin notification service that listened to the Flink output topic and pushed invalidation messages to connected dashboard clients via WebSocket. The client then fetched the updated materialised view on receipt of the push notification.

This pattern kept the dashboard responsive under the 200ms target while keeping the Redshift query layer simple and cacheable.

What we would do differently

Three things we would reconsider on a future engagement:

  1. Schema registry from day one. We added Confluent Schema Registry after initial deployment when schema drift between producers became a maintenance problem. It should have been a foundational requirement from the start.
  2. Separate operational and analytical pipelines earlier. We initially routed compliance archive events through the same Flink topology as operational events. The different retention and processing requirements eventually forced a split. We should have separated them from the beginning.
  3. Consider ClickHouse for the warehouse layer. Amazon Redshift served the requirements, but ClickHouse's columnar engine and materialised view refresh speed would have simplified the notification service at the cost of a self-managed deployment. Worth evaluating seriously for future real-time analytics workloads.

If you are designing a real-time analytics platform and want a review of your architecture, our data engineering team offers architecture review engagements that typically take two to three days.

More from the blog

AI & Machine LearningMay 22, 2026

Why most AI projects fail — and what the successful ones do differently

After 40+ AI implementations, the patterns that separate projects that deliver ROI from those that stall at the POC stage.

Read more
Cloud InfrastructureMay 15, 2026

Kubernetes cost optimisation: five patterns that cut our clients' bills by 40%

Container orchestration is powerful and easy to over-provision. Five techniques we apply to every K8s deployment.

Read more
Digital TransformationApril 14, 2026

Digital transformation without the buzzwords: a practical framework

Most transformation programmes fail because they start with technology instead of outcomes. The framework we use with every client.

Read more
WhatsApp