Real-time analytics is a loaded term. It covers everything from "refresh every 15 minutes" to "events reflected in dashboards within 100ms of occurrence." The architecture choices that serve these requirements are radically different, and conflating them is one of the primary sources of over-engineering in data platform projects.
For this engagement, the client had a genuine operational requirement: their operations team needed to see fulfilment events reflected in dashboards quickly enough to intervene on exceptions before they became customer-facing failures. The 200ms target was not aspirational — it was the time budget within which a supervisor could realistically act on an alert.
Requirements that shaped every decision
Before selecting any technology, we documented the specific requirements that would determine architectural fitness:
- Peak throughput: ~14 events per second average, spikes to ~120 events/second during shift changes
- Event payload size: 0.8–3.2KB JSON
- End-to-end latency: p95 under 200ms, p99 under 500ms
- Data retention: 90 days of raw events, 2 years of aggregated metrics
- Exactly-once semantics required for financial event types
- At-least-once acceptable for operational events
- Downstream consumers: real-time dashboard, ML feature store, compliance archive
These requirements ruled out several tempting architectural simplifications. A serverless event-driven approach (API Gateway → Lambda → DynamoDB Streams) would have hit cold-start latency problems under the p99 requirement. A managed service like Kinesis Data Firehose would have breached the 200ms target due to its minimum 60-second buffer.
Ingestion: Kafka over Kinesis
The first major decision was the ingestion layer. The serious candidates were Apache Kafka (self-managed on EKS), Amazon MSK (managed Kafka), and Amazon Kinesis Data Streams.
We chose Amazon MSK for the following reasons:
- Latency — Kafka's end-to-end producer-to-consumer latency under realistic load is consistently 5–15ms. Kinesis's is 70–200ms. For a 200ms total budget, this headroom matters.
- Exactly-once semantics — Kafka's transactional producer API provides genuine exactly-once delivery semantics for the financial event types. Kinesis requires implementing deduplication in the consumer layer.
- Retention flexibility — Kafka topics can be configured with arbitrary retention periods. We used 7 days of raw retention as a replay buffer, which Kinesis supports only at significant additional cost.
- Ecosystem — Kafka's connector ecosystem (Kafka Connect) gave us pre-built, production-tested sinks to our target stores, reducing custom connector code.
MSK rather than self-managed Kafka eliminated operational overhead at the cost of some configuration flexibility — a trade-off worth making at this scale.
Stream processing: Apache Flink for stateful operations
The pipeline required stateful aggregations — rolling windows, session groupings, and anomaly detection that needed memory of previous events. This ruled out stateless Lambda-based processing.
We deployed Apache Flink on Amazon EMR Serverless. Flink's state backends (RocksDB in this case) allowed us to maintain large in-memory aggregation state without externalising it to a database on every event — the key to hitting the latency target. The Flink jobs consumed from Kafka, applied the windowing and enrichment logic, and produced to two output topics: one for the real-time dashboard consumer, one for the storage layer.
Spark Streaming's micro-batch architecture introduces a minimum latency of one batch interval — typically 500ms to a few seconds. For our sub-200ms requirement, true record-at-a-time processing was required, which is Flink's native execution model.
Storage: dual-write to lake and warehouse
We dual-wrote processed events to two stores with different purposes:
- S3 + Apache Iceberg (data lake) — Raw and processed events in Parquet format, partitioned by hour. Iceberg's time-travel queries and schema evolution support were essential for the compliance archive requirement. Cost: approximately $0.023 per GB per month.
- Amazon Redshift (data warehouse) — Aggregated metrics tables updated via Redshift Streaming Ingestion from the Kafka output topics. The dashboard queries run against Redshift materialised views, not the raw event store. This separation is critical for performance: the dashboard queries are pre-computed and served from memory, not scanned from S3.
Dashboard delivery: materialised views plus WebSocket push
The final latency challenge was getting changes from Redshift into the dashboard. Polling — even at 1-second intervals — would add unacceptable latency and unnecessary query load. We built a thin notification service that listened to the Flink output topic and pushed invalidation messages to connected dashboard clients via WebSocket. The client then fetched the updated materialised view on receipt of the push notification.
This pattern kept the dashboard responsive under the 200ms target while keeping the Redshift query layer simple and cacheable.
What we would do differently
Three things we would reconsider on a future engagement:
- Schema registry from day one. We added Confluent Schema Registry after initial deployment when schema drift between producers became a maintenance problem. It should have been a foundational requirement from the start.
- Separate operational and analytical pipelines earlier. We initially routed compliance archive events through the same Flink topology as operational events. The different retention and processing requirements eventually forced a split. We should have separated them from the beginning.
- Consider ClickHouse for the warehouse layer. Amazon Redshift served the requirements, but ClickHouse's columnar engine and materialised view refresh speed would have simplified the notification service at the cost of a self-managed deployment. Worth evaluating seriously for future real-time analytics workloads.
If you are designing a real-time analytics platform and want a review of your architecture, our data engineering team offers architecture review engagements that typically take two to three days.