Batch Processing vs Stream Processing — Key Differences Explained
Should you process data in large batches every hour, or process each event the moment it arrives? This architectural decision shapes your system's latency, complexity, and cost. This guide explains both approaches with real-world use cases, tools, and a decision framework so you can confidently choose the right architecture for your data pipeline.
Hours
typical batch latency
Milliseconds
typical stream latency
Lambda
architecture combining both
3 tools
Spark, Flink, Kafka Streams
The Core Difference
The fundamental question is: when does data get processed relative to when it arrives?This single distinction cascades into differences in latency, system design, tooling, cost, and operational complexity. Understanding it clearly is the foundation for all data architecture decisions.
| Item | Batch Processing | Stream Processing |
|---|---|---|
| When data is processed | After accumulating a large dataset | As each event/record arrives |
| Latency | Minutes to hours | Milliseconds to seconds |
| Data model | Bounded dataset (finite) | Unbounded stream (continuous) |
| Typical schedule | Scheduled: nightly, hourly, weekly | Continuous: 24/7 |
| Error handling | Retry failed jobs, reprocess batch | Handle failures per event |
| Infrastructure cost | Lower — machines idle between runs | Higher — always running |
| Complexity | Lower — simpler mental model | Higher — stateful, time windows, watermarks |
| Debugging | Easy — rerun with same data | Harder — ephemeral data, replay needed |
| Best for | Reports, ETL, ML training | Fraud detection, live dashboards, alerts |
Batch Processing — How It Works
Batch processing collects data over a period, then processes the entire accumulated dataset at once. Think of it like washing dishes — you let them pile up all day, then wash them all in one go. The delay is acceptable because the output (clean dishes, or a daily report) is only needed once, not continuously.
Data accumulates in storage
Raw events, logs, transactions, or records are written to a data lake (S3, GCS, HDFS) or database throughout the day. Nothing is processed yet — data just collects.
Scheduled job triggers
A scheduler (cron, Apache Airflow, Prefect, dbt) triggers the batch job at the designated time — nightly at 2am, hourly at :00, or on an event like file arrival.
Batch engine reads the entire dataset
Apache Spark, Hadoop MapReduce, or a SQL engine reads all accumulated records from storage. The full dataset is available for any operation — sorting, joining, aggregating across all rows.
Transforms, aggregates, computes
The engine applies transformations: filtering invalid records, joining with reference data, computing aggregates (revenue by category), training ML models, or generating report datasets.
Writes results to destination
The processed output is written to a data warehouse (BigQuery, Redshift, Snowflake), a database, or back to a data lake in a partitioned format ready for querying.
Job completes, machines idle
Once the batch job finishes, the compute cluster shuts down or idles. On cloud infrastructure, this means you pay only for the time the job runs — not 24/7.
Apache Spark
The dominant batch engine. Processes TB to PB efficiently. Python (PySpark), Scala, Java, SQL APIs. Used by Netflix, Uber, Airbnb for ETL and ML. Runs on YARN, Kubernetes, or managed services (Databricks, EMR).
Apache Hadoop MapReduce
The original big data batch framework. Largely superseded by Spark for new projects (Spark is 10-100x faster in-memory) but still present in legacy enterprise systems.
AWS Glue / Google Dataflow
Managed ETL services. Write the transformation logic, cloud handles the infrastructure provisioning and scaling. Lower operational overhead than self-managed Spark.
SQL + Scheduled Jobs
For smaller scale: a nightly SQL job in PostgreSQL, Redshift, or BigQuery is perfectly valid batch processing. dbt (data build tool) organizes and schedules SQL transformations.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, date_trunc, round
spark = SparkSession.builder.appName("daily-revenue").getOrCreate()
# Read a day's worth of orders from S3 data lake
df = spark.read.parquet("s3://data-lake/orders/date=2026-03-25/")
# Step 1: Filter to completed orders only
completed = df.filter(col("status") == "completed")
# Step 2: Aggregate revenue by product category
revenue_by_category = (
completed
.groupBy("category", "region")
.agg(
sum("total").alias("revenue"),
count("*").alias("order_count"),
round(sum("total") / count("*"), 2).alias("avg_order_value")
)
.orderBy(col("revenue").desc())
)
# Step 3: Write to data warehouse (partitioned for efficient querying)
revenue_by_category.write .mode("overwrite") .partitionBy("region") .parquet("s3://data-warehouse/daily-revenue/2026-03-25/")
print(f"Processed {completed.count():,} orders")
spark.stop()-- dbt model: daily_revenue.sql
-- Runs nightly after raw order data lands in the warehouse
{{ config(
materialized='incremental',
partition_by={'field': 'order_date', 'data_type': 'date'},
cluster_by=['region', 'category']
) }}
SELECT
DATE(created_at) AS order_date,
region,
category,
COUNT(*) AS order_count,
SUM(total) AS revenue,
AVG(total) AS avg_order_value,
COUNT(DISTINCT user_id) AS unique_customers
FROM {{ source('raw', 'orders') }}
WHERE status = 'completed'
{% if is_incremental() %}
-- Only process new data since last run
AND DATE(created_at) > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2, 3Stream Processing — How It Works
Stream processing handles data continuously — each event is processed within milliseconds of arrival. Think of an assembly line: each item is processed as it comes through, not queued up for a batch operation at the end of the day. The key insight is that the "dataset" is never-ending; it's an infinite sequence of events.
Event occurs
A user clicks, a payment is made, a sensor fires, an order is placed — any discrete event is captured. The event is a structured record: timestamp, event type, payload data.
Published to a message bus
The event is published to Apache Kafka, AWS Kinesis, or Google Pub/Sub. The message bus durably stores it and makes it available to consumers, even if they are temporarily offline.
Stream processor reads in real time
Apache Flink, Kafka Streams, or Spark Structured Streaming reads events from the message bus as they arrive. The processor maintains running state across events.
Applies logic, aggregations, windowing
The processor applies stateful logic: counting events per window, joining streams, detecting anomaly patterns, computing rolling averages. Time windows (tumbling, sliding, session) define how events are grouped.
Emits results immediately
Results are written to downstream systems in real time: a database for dashboards to query, an alerting system, another Kafka topic, or a cache like Redis.
Repeat for every event, continuously
The pipeline runs 24/7 without stopping. New events keep arriving; the processor keeps processing. There is no "job finished" — it is an ongoing, stateful computation.
Apache Kafka Streams
Java library for building stream applications directly on top of Kafka. Stateful operations, windowed aggregations, exactly-once semantics. No separate cluster needed — runs inside your application.
Apache Flink
The most powerful and widely-adopted stream processor. True event-time processing, complex stateful logic, low latency at massive scale. Used at Uber, Netflix, Alibaba for fraud detection and real-time analytics.
Spark Structured Streaming
Spark's streaming mode uses micro-batches (processes small windows of data every N seconds) rather than true event-by-event processing. Latency is 1-30 seconds — not milliseconds. Good for teams already using Spark.
AWS Kinesis / Google Pub/Sub
Managed streaming infrastructure services. Less setup and operational overhead than self-managed Kafka. Less control, higher per-event cost at scale. Good starting point for teams new to streaming.
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4) # Process in parallel across 4 workers
t_env = StreamTableEnvironment.create(env)
# Read from Kafka topic in real time
t_env.execute_sql("""
CREATE TABLE orders (
order_id STRING,
user_id STRING,
total DOUBLE,
category STRING,
ts TIMESTAMP(3),
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'order-events',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'flink-revenue-processor',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
""")
# Real-time revenue windows: tumbling 1-minute windows
result = t_env.execute_sql("""
SELECT
TUMBLE_START(ts, INTERVAL '1' MINUTE) AS window_start,
category,
SUM(total) AS revenue,
COUNT(*) AS order_count,
AVG(total) AS avg_order_value
FROM orders
GROUP BY
TUMBLE(ts, INTERVAL '1' MINUTE),
category
""")
result.print()// Kafka Streams: real-time order counting per category
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("order-events");
KTable<Windowed<String>, Long> categoryCounts = orders
.filter((key, order) -> "completed".equals(order.getStatus()))
.groupBy((key, order) -> order.getCategory())
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
.count(Materialized.as("category-counts-store"));
// Write real-time counts back to Kafka
categoryCounts.toStream()
.map((windowedKey, count) -> KeyValue.pair(
windowedKey.key(),
new CategoryCount(windowedKey.key(), count,
windowedKey.window().startTime())
))
.to("category-counts-output");
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();When to Use Each Approach
| Item | Use Batch When... | Use Streaming When... |
|---|---|---|
| Latency requirement | Hours-old data is acceptable for business decisions | Need real-time or near-real-time results (< 1 minute) |
| Use case examples | Nightly reports, ML model training, ETL pipelines, invoicing | Fraud detection, live dashboards, real-time alerts, recommendations |
| Team experience | Smaller team, simpler stack preferred, SQL expertise | Data engineering team comfortable with Kafka/Flink or Spark Streaming |
| Cost priority | Want to minimize infrastructure spend | Can justify higher infra cost for lower latency business value |
| Data volume | Very large historical data (TB+) processed periodically | High-velocity event streams (thousands to millions of events/sec) |
| Correctness | Need perfect accuracy over approximation | Can tolerate eventual consistency or approximation in real-time |
| Debugging needs | Need to easily rerun and debug failed jobs | Can handle debugging distributed stateful systems |
Lambda vs Kappa Architecture
The Lambda Architecture combines both: a batch layer for complete, accurate historical results and a speed layer for real-time (approximate) results. Results merge at query time. Modern teams increasingly use the Kappa Architecture (stream-only, with replay capability) to reduce the operational burden of maintaining two separate systems.
Real-World Use Cases
E-commerce — Both
Stream: real-time inventory updates when items are purchased, fraud alerts during checkout, instant order confirmation emails. Batch: nightly sales reports, product recommendation model training, revenue reconciliation, customer lifetime value calculations.
Finance — Mostly Stream
Fraud detection must happen in milliseconds during a transaction — a 5-second delay means the fraudulent charge goes through. End-of-day reconciliation and regulatory reporting use batch. Position calculation during trading hours uses stream.
Analytics — Mostly Batch
Weekly business reports, cohort analysis, A/B test result calculation, customer segmentation. Latency of hours is fine for these insights. Spark on a warehouse (BigQuery, Redshift, Snowflake) with dbt orchestration.
Notifications — Stream
"Your order shipped" must trigger within seconds of the shipping event being recorded. A batch job would delay the notification by hours. Push notifications, SMS, and webhook triggers all require streaming or near-real-time event processing.
Log Processing — Both
Real-time alerting on error rate spikes (stream: Flink or Kinesis). Long-term log analysis, usage reporting, cost attribution (batch: Spark on cold storage). Most companies run both in parallel for log data.
Machine Learning — Both
Model training: always batch (needs full dataset). Feature engineering for training: batch. Real-time inference: streaming features computed in real time and served from a feature store. Model monitoring: stream (detect drift immediately).
Stream Processing Concepts You Need to Know
Stream processing introduces concepts that do not exist in batch processing. Understanding these is essential before adopting a streaming architecture.
Event time vs processing time
Event time is when the event actually occurred. Processing time is when your system processes it. These differ due to network delays and late-arriving data. Flink's event-time processing handles this correctly; processing-time systems can produce incorrect results for out-of-order data.
Watermarks
A watermark tells the stream processor: "I will wait up to X seconds for late-arriving events before closing this time window." Events arriving after the watermark are dropped or sent to a side output. Critical for producing correct windowed aggregations.
Windowing
Groups events by time for aggregation. Tumbling windows: fixed non-overlapping periods (every 1 minute). Sliding windows: overlapping periods (5-minute window every 1 minute). Session windows: grouped by activity gaps (30-minute inactivity ends a session).
Exactly-once semantics
Guarantees each event is processed exactly once — not zero times (loss) or multiple times (duplication). Requires coordination between the message bus and the processor. Flink with Kafka supports exactly-once. Important for financial systems.
Backpressure
When a stream processor cannot keep up with the incoming event rate, it signals upstream systems to slow down. Proper backpressure handling prevents memory overflow and system crashes. Flink handles backpressure natively.
State management
Streaming computations often need to remember past events (running totals, session data, joined records). State must be stored somewhere (RocksDB in Flink) and backed up for fault tolerance — this is the primary operational complexity of streaming systems.
Start with batch, add streaming only when needed