Solved: Spark Silent Failure

ps19234561 · ‎12-08-2025

Long time listener, first time caller here.

We have our core deduplication process configured in Fabric. It had been running exceptional for months. The last succesful run was on November 13, then we didn't run it again until the week of November 20th. It is now silently failing with little or no error codes to go on - it appears something has changed. We have tried everything on our side, but nothing we do can trigger a success.

We even tried to migrate back to Synapse, and it silently fails in the same place. Our input is only 15,000 rows, and I for testing I reduced to 500 - it made no difference in the

We are using Zingg OSS API (0.5.0 and Fabric runtime 1.3 - spark 3.5.5) for our core entity resolution. All phases of Zingg such as findTrainingData, Label and Train work well and write to our lakehouse. However the match phase, takes 5 times longer to run and silently fails - it does not produce an output to anywhere in our lakehouse. The job just appears to stall out, logs finish mid write and the spark job indicates "success". No errors are present anywhere in the log. While the job is running I cannot get to the spark monitor, it presents a busy message.

We did open a support case with Microsoft - 2511280040006082 - who indicate that the VHD appears to have changed, and we are waiting for follow-up as to next steps.

But at the same time, I am curious if anyone else has experienced something similar, and has any configuration advice that we can try on our side?

ps19234561 · ‎12-09-2025

Thank you very much for the response. I tried running again and it "failed". Then as per your recommendation I configured this:

#Based on comments from the fabric community - try these until Microsoft resolves this issue

spark.conf.set("spark.sql.shuffle.partitions", "10")

spark.conf.set("spark.sql.adaptive.enabled", "false")

#spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - not exposed in Fabric

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

This works for us, and restores our deduplication process for the time being until Microsoft fixes the regression issue.

View solution in original post

v-tsaipranay · ‎12-12-2025

Hi @ps19234561 ,

Thank you for confirming the results. It’s great to hear that the configuration adjustments helped restore your deduplication process while the underlying regression is being addressed. If you run into any additional issues or have further questions as you continue testing, please feel free to reach out again in the Fabric Community we're here to support you.

Thnak you.

Vinodh247 · ‎12-08-2025

This is almost a Fabric runtime regression not your zingg configuration. When a workload runs perfectly for months then starts “succeeding” with no output, no errors, and blocked Spark UI access, it points to changes in the underlying Spark VHD or fabric worker image. Your support ticket note about “VHD appears to have changed” confirms that something in the Nov2024 rollout likely broke long-running or shuffle heavy jobs.

Key signals that this is platform-side, not Zingg side:

All earlier Zingg phases run fine - Only match fails. Match is the most shuffle-intensive stage and the first to expose regressions in cluster image or shuffle service.
Silent success with partial logs is a typical Fabric worker crash or executor reset. Fabric often marks these as “successful” if the driver exits cleanly.
Spark monitor being unavailable is a known symptoms when the driver is stuck in a dead state or the worker hosting the UI crashed.
Same failure happens in Synapse runtime as well. That strongly suggests that both runtimes now share the updated VHD image.

What you can try locally, while you wait for microsoft engineering:

Explicitly pin shuffle partitions
spark.sql.shuffle.partitions = 50 (or even lower for 15k rows)
Disable adaptive execution
spark.sql.adaptive.enabled = false
Force Kryo serialization
spark.serializer = org.apache.spark.serializer.KryoSerializer
Reduce broadcast threshold
spark.sql.autoBroadcastJoinThreshold = -1
Run match with very small sample to confirm the job actually enters Zingg’s matching phase. If even 500 rows fail silently, the problem is platform level.
Try running in a completely new workspace and new Lakehouse (clean compute root). This occasionally avoids corrupted mount artifacts.

realistically speaking, none of these are likely to fix a true VHD regression, but they help you rule out config issues and give repro data to support.

Please 'Kudos' and 'Accept as Solution' if this answered your query.

Regards,
Vinodh
Microsoft MVP [Fabric]

ps19234561 · ‎12-09-2025

Thank you very much for the response. I tried running again and it "failed". Then as per your recommendation I configured this:

#Based on comments from the fabric community - try these until Microsoft resolves this issue

spark.conf.set("spark.sql.shuffle.partitions", "10")

spark.conf.set("spark.sql.adaptive.enabled", "false")

#spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - not exposed in Fabric

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

This works for us, and restores our deduplication process for the time being until Microsoft fixes the regression issue.

Ugk161610 · ‎12-08-2025

@ps19234561 ,

We actually ran into a very similar issue recently, so your post sounds familiar. Our Spark jobs had been working fine for months, then suddenly one phase started “succeeding” with no output and logs stopping halfway through. The Spark UI also showed “busy” and wouldn’t load while the job was running.

In our case, nothing in the code or data changed — the only thing that lined up was a recent Fabric runtime/VHD update. After that update, one part of the workflow would hang quietly instead of throwing an error.

A few things that helped us narrow it down:

Running the same notebook in a new workspace worked, which told us it wasn’t the logic.
Clearing the session and re-attaching the lakehouse made the job run further, but not consistently.
Re-adding our dependency jars forced the environment to refresh and helped temporarily.

Not perfect fixes, but enough to confirm it wasn’t data size or Zingg config — it was the environment.

Since Microsoft already told you the VHD changed, your support case is definitely the right path. If they share a workaround or rollback option, please update here — this might be affecting more people.

GopiKrishna

Spark Silent Failure

Helpful resources

Fabric Monthly Update - December 2025

FabCon Atlanta 2026

FabCon is coming to Atlanta

Spark Silent Failure

Helpful resources

Fabric Monthly Update - December 2025

FabCon Atlanta 2026