Batch Processing vs Stream Processing — Key Differences Explained

Should you process data in large batches every hour, or process each event the moment it arrives? This architectural decision shapes your system's latency, complexity, and cost. This guide explains both approaches with real-world use cases, tools, and a decision framework so you can confidently choose the right architecture for your data pipeline.

Hours

typical batch latency

Milliseconds

typical stream latency

Lambda

architecture combining both

3 tools

Spark, Flink, Kafka Streams

1

The Core Difference

The fundamental question is: when does data get processed relative to when it arrives?This single distinction cascades into differences in latency, system design, tooling, cost, and operational complexity. Understanding it clearly is the foundation for all data architecture decisions.

ItemBatch ProcessingStream Processing
When data is processedAfter accumulating a large datasetAs each event/record arrives
LatencyMinutes to hoursMilliseconds to seconds
Data modelBounded dataset (finite)Unbounded stream (continuous)
Typical scheduleScheduled: nightly, hourly, weeklyContinuous: 24/7
Error handlingRetry failed jobs, reprocess batchHandle failures per event
Infrastructure costLower — machines idle between runsHigher — always running
ComplexityLower — simpler mental modelHigher — stateful, time windows, watermarks
DebuggingEasy — rerun with same dataHarder — ephemeral data, replay needed
Best forReports, ETL, ML trainingFraud detection, live dashboards, alerts
2

Batch Processing — How It Works

Batch processing collects data over a period, then processes the entire accumulated dataset at once. Think of it like washing dishes — you let them pile up all day, then wash them all in one go. The delay is acceptable because the output (clean dishes, or a daily report) is only needed once, not continuously.

1

Data accumulates in storage

Raw events, logs, transactions, or records are written to a data lake (S3, GCS, HDFS) or database throughout the day. Nothing is processed yet — data just collects.

2

Scheduled job triggers

A scheduler (cron, Apache Airflow, Prefect, dbt) triggers the batch job at the designated time — nightly at 2am, hourly at :00, or on an event like file arrival.

3

Batch engine reads the entire dataset

Apache Spark, Hadoop MapReduce, or a SQL engine reads all accumulated records from storage. The full dataset is available for any operation — sorting, joining, aggregating across all rows.

4

Transforms, aggregates, computes

The engine applies transformations: filtering invalid records, joining with reference data, computing aggregates (revenue by category), training ML models, or generating report datasets.

5

Writes results to destination

The processed output is written to a data warehouse (BigQuery, Redshift, Snowflake), a database, or back to a data lake in a partitioned format ready for querying.

6

Job completes, machines idle

Once the batch job finishes, the compute cluster shuts down or idles. On cloud infrastructure, this means you pay only for the time the job runs — not 24/7.

Apache Spark

The dominant batch engine. Processes TB to PB efficiently. Python (PySpark), Scala, Java, SQL APIs. Used by Netflix, Uber, Airbnb for ETL and ML. Runs on YARN, Kubernetes, or managed services (Databricks, EMR).

Apache Hadoop MapReduce

The original big data batch framework. Largely superseded by Spark for new projects (Spark is 10-100x faster in-memory) but still present in legacy enterprise systems.

AWS Glue / Google Dataflow

Managed ETL services. Write the transformation logic, cloud handles the infrastructure provisioning and scaling. Lower operational overhead than self-managed Spark.

SQL + Scheduled Jobs

For smaller scale: a nightly SQL job in PostgreSQL, Redshift, or BigQuery is perfectly valid batch processing. dbt (data build tool) organizes and schedules SQL transformations.

pythonpyspark_batch_job.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, count, date_trunc, round

spark = SparkSession.builder.appName("daily-revenue").getOrCreate()

# Read a day's worth of orders from S3 data lake
df = spark.read.parquet("s3://data-lake/orders/date=2026-03-25/")

# Step 1: Filter to completed orders only
completed = df.filter(col("status") == "completed")

# Step 2: Aggregate revenue by product category
revenue_by_category = (
    completed
    .groupBy("category", "region")
    .agg(
        sum("total").alias("revenue"),
        count("*").alias("order_count"),
        round(sum("total") / count("*"), 2).alias("avg_order_value")
    )
    .orderBy(col("revenue").desc())
)

# Step 3: Write to data warehouse (partitioned for efficient querying)
revenue_by_category.write     .mode("overwrite")     .partitionBy("region")     .parquet("s3://data-warehouse/daily-revenue/2026-03-25/")

print(f"Processed {completed.count():,} orders")
spark.stop()
sqldbt_daily_revenue.sql
-- dbt model: daily_revenue.sql
-- Runs nightly after raw order data lands in the warehouse

{{ config(
    materialized='incremental',
    partition_by={'field': 'order_date', 'data_type': 'date'},
    cluster_by=['region', 'category']
) }}

SELECT
    DATE(created_at)        AS order_date,
    region,
    category,
    COUNT(*)                AS order_count,
    SUM(total)              AS revenue,
    AVG(total)              AS avg_order_value,
    COUNT(DISTINCT user_id) AS unique_customers
FROM {{ source('raw', 'orders') }}
WHERE status = 'completed'
{% if is_incremental() %}
    -- Only process new data since last run
    AND DATE(created_at) > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2, 3
3

Stream Processing — How It Works

Stream processing handles data continuously — each event is processed within milliseconds of arrival. Think of an assembly line: each item is processed as it comes through, not queued up for a batch operation at the end of the day. The key insight is that the "dataset" is never-ending; it's an infinite sequence of events.

1

Event occurs

A user clicks, a payment is made, a sensor fires, an order is placed — any discrete event is captured. The event is a structured record: timestamp, event type, payload data.

2

Published to a message bus

The event is published to Apache Kafka, AWS Kinesis, or Google Pub/Sub. The message bus durably stores it and makes it available to consumers, even if they are temporarily offline.

3

Stream processor reads in real time

Apache Flink, Kafka Streams, or Spark Structured Streaming reads events from the message bus as they arrive. The processor maintains running state across events.

4

Applies logic, aggregations, windowing

The processor applies stateful logic: counting events per window, joining streams, detecting anomaly patterns, computing rolling averages. Time windows (tumbling, sliding, session) define how events are grouped.

5

Emits results immediately

Results are written to downstream systems in real time: a database for dashboards to query, an alerting system, another Kafka topic, or a cache like Redis.

6

Repeat for every event, continuously

The pipeline runs 24/7 without stopping. New events keep arriving; the processor keeps processing. There is no "job finished" — it is an ongoing, stateful computation.

Apache Kafka Streams

Java library for building stream applications directly on top of Kafka. Stateful operations, windowed aggregations, exactly-once semantics. No separate cluster needed — runs inside your application.

Apache Flink

The most powerful and widely-adopted stream processor. True event-time processing, complex stateful logic, low latency at massive scale. Used at Uber, Netflix, Alibaba for fraud detection and real-time analytics.

Spark Structured Streaming

Spark's streaming mode uses micro-batches (processes small windows of data every N seconds) rather than true event-by-event processing. Latency is 1-30 seconds — not milliseconds. Good for teams already using Spark.

AWS Kinesis / Google Pub/Sub

Managed streaming infrastructure services. Less setup and operational overhead than self-managed Kafka. Less control, higher per-event cost at scale. Good starting point for teams new to streaming.

pythonflink_stream_processing.py
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)  # Process in parallel across 4 workers
t_env = StreamTableEnvironment.create(env)

# Read from Kafka topic in real time
t_env.execute_sql("""
  CREATE TABLE orders (
    order_id STRING,
    user_id  STRING,
    total    DOUBLE,
    category STRING,
    ts       TIMESTAMP(3),
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
  ) WITH (
    'connector' = 'kafka',
    'topic'     = 'order-events',
    'properties.bootstrap.servers' = 'kafka:9092',
    'properties.group.id' = 'flink-revenue-processor',
    'scan.startup.mode' = 'latest-offset',
    'format'    = 'json'
  )
""")

# Real-time revenue windows: tumbling 1-minute windows
result = t_env.execute_sql("""
  SELECT
    TUMBLE_START(ts, INTERVAL '1' MINUTE) AS window_start,
    category,
    SUM(total)  AS revenue,
    COUNT(*)    AS order_count,
    AVG(total)  AS avg_order_value
  FROM orders
  GROUP BY
    TUMBLE(ts, INTERVAL '1' MINUTE),
    category
""")

result.print()
javaKafkaStreamsApp.java
// Kafka Streams: real-time order counting per category
StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("order-events");

KTable<Windowed<String>, Long> categoryCounts = orders
    .filter((key, order) -> "completed".equals(order.getStatus()))
    .groupBy((key, order) -> order.getCategory())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    .count(Materialized.as("category-counts-store"));

// Write real-time counts back to Kafka
categoryCounts.toStream()
    .map((windowedKey, count) -> KeyValue.pair(
        windowedKey.key(),
        new CategoryCount(windowedKey.key(), count,
            windowedKey.window().startTime())
    ))
    .to("category-counts-output");

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
4

When to Use Each Approach

ItemUse Batch When...Use Streaming When...
Latency requirementHours-old data is acceptable for business decisionsNeed real-time or near-real-time results (< 1 minute)
Use case examplesNightly reports, ML model training, ETL pipelines, invoicingFraud detection, live dashboards, real-time alerts, recommendations
Team experienceSmaller team, simpler stack preferred, SQL expertiseData engineering team comfortable with Kafka/Flink or Spark Streaming
Cost priorityWant to minimize infrastructure spendCan justify higher infra cost for lower latency business value
Data volumeVery large historical data (TB+) processed periodicallyHigh-velocity event streams (thousands to millions of events/sec)
CorrectnessNeed perfect accuracy over approximationCan tolerate eventual consistency or approximation in real-time
Debugging needsNeed to easily rerun and debug failed jobsCan handle debugging distributed stateful systems

Lambda vs Kappa Architecture

The Lambda Architecture combines both: a batch layer for complete, accurate historical results and a speed layer for real-time (approximate) results. Results merge at query time. Modern teams increasingly use the Kappa Architecture (stream-only, with replay capability) to reduce the operational burden of maintaining two separate systems.

5

Real-World Use Cases

E-commerce — Both

Stream: real-time inventory updates when items are purchased, fraud alerts during checkout, instant order confirmation emails. Batch: nightly sales reports, product recommendation model training, revenue reconciliation, customer lifetime value calculations.

Finance — Mostly Stream

Fraud detection must happen in milliseconds during a transaction — a 5-second delay means the fraudulent charge goes through. End-of-day reconciliation and regulatory reporting use batch. Position calculation during trading hours uses stream.

Analytics — Mostly Batch

Weekly business reports, cohort analysis, A/B test result calculation, customer segmentation. Latency of hours is fine for these insights. Spark on a warehouse (BigQuery, Redshift, Snowflake) with dbt orchestration.

Notifications — Stream

"Your order shipped" must trigger within seconds of the shipping event being recorded. A batch job would delay the notification by hours. Push notifications, SMS, and webhook triggers all require streaming or near-real-time event processing.

Log Processing — Both

Real-time alerting on error rate spikes (stream: Flink or Kinesis). Long-term log analysis, usage reporting, cost attribution (batch: Spark on cold storage). Most companies run both in parallel for log data.

Machine Learning — Both

Model training: always batch (needs full dataset). Feature engineering for training: batch. Real-time inference: streaming features computed in real time and served from a feature store. Model monitoring: stream (detect drift immediately).

6

Stream Processing Concepts You Need to Know

Stream processing introduces concepts that do not exist in batch processing. Understanding these is essential before adopting a streaming architecture.

Event time vs processing time

Event time is when the event actually occurred. Processing time is when your system processes it. These differ due to network delays and late-arriving data. Flink's event-time processing handles this correctly; processing-time systems can produce incorrect results for out-of-order data.

Watermarks

A watermark tells the stream processor: "I will wait up to X seconds for late-arriving events before closing this time window." Events arriving after the watermark are dropped or sent to a side output. Critical for producing correct windowed aggregations.

Windowing

Groups events by time for aggregation. Tumbling windows: fixed non-overlapping periods (every 1 minute). Sliding windows: overlapping periods (5-minute window every 1 minute). Session windows: grouped by activity gaps (30-minute inactivity ends a session).

Exactly-once semantics

Guarantees each event is processed exactly once — not zero times (loss) or multiple times (duplication). Requires coordination between the message bus and the processor. Flink with Kafka supports exactly-once. Important for financial systems.

Backpressure

When a stream processor cannot keep up with the incoming event rate, it signals upstream systems to slow down. Proper backpressure handling prevents memory overflow and system crashes. Flink handles backpressure natively.

State management

Streaming computations often need to remember past events (running totals, session data, joined records). State must be stored somewhere (RocksDB in Flink) and backed up for fault tolerance — this is the primary operational complexity of streaming systems.

Start with batch, add streaming only when needed

Many teams add streaming prematurely because it sounds sophisticated. Start with batch processing. If your business requirements genuinely require sub-minute latency and batch cannot deliver it, then invest in a streaming architecture. The operational complexity of streaming is significant — make sure the business value justifies it.

Frequently Asked Questions