🚀 From 60 Minutes to 30 Minutes: Optimizing Spark...

NHariGouthami · ‎09-09-2025

Recently, I faced what looked like a “small” data engineering challenge:
We added a new column (ClientPlatform) to our data pipeline.

Sounds simple, right? But there was a catch — 30 days of historical data were missing the column. With terabytes of raw data generated per day in complex nested JSON, reprocessing everything would be painfully slow and costly. In fact, a straightforward backfill attempt took nearly 30 hours 😬.

Instead of brute force, I went down the optimization rabbit hole — and cut:

JSON reading time by 78x (6.83 min → 5.26s)
Backfill time for 30 days from 30 hours → 15hours (saved 15 hours)

Here’s how I did it 👇

🎯 The Problem

Today’s data had ClientPlatform ✅
Last 30 days didn’t ❌
Goal: backfill the column efficiently without blowing up compute or runtime.

A classic data engineering pain: historical fixes at scale.

1. JSON Reading — The Game Changer

The raw data was in compressed JSON (.json.gz).

Pre-defined schema (instead of letting Spark infer)
Applied strict JSON parsing options

👉 Result: JSON read time dropped from 6.83 min → 5.26s ⚡

2. The count() Trap

During development, I added:

df.count()

just to monitor progress. Sounds harmless?

It added 4+ minutes per run 🤦

count() forces a full dataset scan.
Spark launched 232 tasks for 24 cores (over-partitioned).
Stragglers dragged the job down.

Fix: Removed count() from intermediate steps → instant 78x boost.

💡 Lesson: Use file statistics, sampling, or defer counts to the final step.

3. Delta Lake Merge Optimization

When we realized we needed to populate the new column (ClientPlatform) for the past 30 days, the naïve approach would have been:

Re-read all historical JSONs
Transform them fully
Overwrite existing Delta tables

That means reprocessing terabytes of data just to add one column. Costly in time, compute, and storage.

✅ Enter Delta Merge

Instead of reprocessing everything, Delta Lake allows you to surgically update only the missing records.

Here’s the simplified syntax we used:

from delta.tables import DeltaTable
delta_table.alias("target").merge(
    df_platform_raw.alias("source"),
    "target.ConversationId = source.ConversationId "
    "AND target.RequestId = source.RequestId "
    f"AND target.slicedate = {slice_date} "
    f"AND target.PRegion = '{REGION}'"
).whenMatchedUpdate(
    condition="target.ClientPlatform IS NULL",
    set={"ClientPlatform": "source.ClientPlatform"}
).execute()

🚀 Why this is powerful

No full table rewrite → only the affected rows are touched.
Conditional updates → we only updated records where ClientPlatform was NULL.
ACID guarantees → even partial failures don’t corrupt data.
Massive time savings → our 30-day backfill dropped from 30 hours → 15 hours.

Press enter or click to view image in full size

🚀 Spark & Delta Lake Optimization: Before vs After

🎯 Takeaway

Performance is rarely about one big trick. It’s about stacking small optimizations — configuration, schema design, partitioning, caching — that add up to dramatic results.

What looked like a “simple backfill” turned into a masterclass in Spark performance tuning. Hopefully, this walkthrough saves you some hours when you hit a similar challenge.

💬 I’d love to hear — what’s the biggest performance trap you have run into while working with Spark or Delta Lake?

Royel · ‎09-24-2025

Well explained, I’ve also seen big improvements using similar optimizations for partitioned tables, and it saved a lot of time.