Solved: SessionExpiredException occurs regularly in driver...

dbeavon3 · ‎10-18-2025

My spark jobs have been failing regularly and the following seems to be one of the things that predicts an imminent failure:

2025-10-12 23:29:27,220 WARN ClientCnxn [Thread-62-SendThread(vm-6a611403:2181)]: Session 0x1000000cfbb0000 for sever vm-6a611403/10.0.0.4:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 37256ms for session id 0x1000000cfbb0000
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258)

Whenever we see that SessionExpiredException from ClientCnxn in org.apache.zookeeper, then soon afterwards our executors will stop sending heartbeats and then they will be unceremoniously terminated.

... continued...

Our jobs don't respond well, if the executors are terminated mid-way thru a series of tasks.

Can someone please explain what the ClientCnxn is (zookeeper) and why we keep getting this issue (SessionTimeoutException)?

v-achippa · ‎11-03-2025

Hi @dbeavon3,

Thank you for the response. In Fabric the spark coordination components(including ZooKeeper) are service-managed and containerized, not customer-managed.
They are isolated at the capacity level, not at the individual workspace level.

On a dedicated capacity, these services run within resources allocated only to that capacity and are not shared with other tenants.
Multiple workspaces under the same capacity will share that capacity’s resources, but the service manages scheduling and isolation internally to prevent one job from affecting another.
On a shared capacity, the coordination layer is multi-tenant and managed by Microsoft’s service fabric layer, but each job runs in its own isolated session.

You are right that these logs can appear even though the component is not customer-managed, they simply reflect transient coordination retries within the platform.

If the timeout changes reduce failures, it confirms a transient condition, If not the Fabric Support can review the backend ZooKeeper health for your job timestamps.

Thanks and regards,

Anjan Kumar Chippa

View solution in original post

BhaveshPatel · ‎10-19-2025

Hi @dbeavon3

Hope you are fine. As per you, your jobs don't respond well and driver/executor is failing. Could you please tell me how you are using bronze, silver and gold layer. Also, In Databricks it is relatively easy and Microsoft adopted the same system (Delta Lake). In Microsoft Fabric, there is not different cluster ( job cluster, or all purpose cluster ) etc.

by the way, It's a Linux System. ( %fs ls ) commands.

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

dbeavon3 · ‎10-20-2025

Yes, we extract data with multiple layers.

Can you please focus on the technical reason for these job failures? The messages in the logs are meaningless without having a very low-level understanding of how Microsoft is hosting the so-called "ClientCnxn" (for zookeeper). I have NEVER seen zookeeper errors from my spark jobs on other platforms. And I have been working with spark for over three years now.

v-achippa · ‎10-21-2025

Hi @dbeavon3,

Thank you for reaching out to Microsoft Fabric Community.

Thank you @BhaveshPatel for the prompt response.

Thank you for sharing the detailed logs. This is a zookeeper client session expiry, which usually occurs when the spark driver temporarily loses connection with the zookeeper service. Mostly this causes because of driver side GC or CPU pause or transient network interruption between the driver and zookeeper service.

Please try re-running the job once with slightly higher timeouts like increase the spark timeouts, if it stabilizes then the issue was likely transient.

If it still fails with the same ZooKeeper error, then it can be platform side zookeeper issue. This error is not related to bronze/silver/gold data layers it is a coordination issue at the spark runtime level.

In this case I recommend raising a Fabric support ticket with the timestamps so the backend team can check the zookeeper health. To raise a support ticket, kindly follow the steps outlined in the following guide:

How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn

Thanks and regards,

Anjan Kumar Chippa

dbeavon3 · ‎10-22-2025

Hi @v-achippa

Can you please give some more actionable instructions? (it would also be for others who encounter the same bug in their clusters)

This zookeeper problem never happened on the spark clusters that run in databricks or HDI. I'm assuming there is some sort of proprietary hosting of the zookeeper components in Synapse and it seems fragile. (In HDInsight, the zookeeper components ran on separate VM's but that does not seem to be the case in fabric.)

Spark is a complex platform and we need more visibility to investigate the reason for these failures. Whenever I reach out to MT or the PG, it is a multi-week effort; but there should be no reason for wasting mroe than two weeks. I would like to be able to gather/investigate the logs myself. I have an extremely high failure rate in Fabric when hosting spark workloads in here. This is one of many errors. Even if Microsoft can't fix their bugs, then they should at least provide access to the necessary logs so we can distinguish between bugs, and try to organize all the different varieties of failures into separate categories.

The Fabric flavor seems unusual in many ways. On the databricks and HDI platforms I NEVER get the sense that my own diagnostic logs are being hidden from me. I've always had the necessary access to troubleshoot these failures independently - both failures in my own code and in the cluster.

>> try re-running the job once with slightly higher timeouts like increase the spark timeout

Can you be more specific about which configurations to change? Can you tell me if zookeeper is hosted in on customer-specific capacity or if we are potentially competing with other customers for these resources?

v-achippa · ‎10-28-2025

Hi @dbeavon3,

Thank you for the detailed response, can you please try increasing the following Spark settings and re-run the job:

--conf spark.network.timeout=300s
--conf spark.executor.heartbeatInterval=30s

These control the heartbeat between the Spark driver and ZooKeeper, increasing them helps brief pauses or GC delays that can otherwise cause session expiry.

Regarding ZooKeeper hosting in Fabric, the Spark coordination services (including ZooKeeper) are managed within the service, not on separate customer visible VMs like in HDInsight. If your workspace runs on dedicated capacity, those resources are isolated to your tenant on shared capacity, they are multi-tenant but still isolated at the process level.

If the failures continue even after adjusting the timeouts, I recommend raising a Fabric support ticket with the job ID and timestamps so the backend team can check ZooKeeper health and capacity side metrics.

Thanks and regards,

Anjan Kumar Chippa

dbeavon3 · ‎10-28-2025

>> "managed within the service".

Ideally we wouldn't be seeing errors about these things that we don't manage ourselves; especially when they don't have any customer-facing surface area and we cannot troubleshoot them.

I will try the configuration settings that you shared and see if there is a difference.

>> If your workspace runs on dedicated capacity, those resources are isolated to your tenant

Are they isolated to a workspace? Is it possible that my workloads are conflicting with other workloads in the same tenant? It is unfortunate that the zookeeper behavior is decreasing the reliability and health of the cluster. I'm guessing it is underprovisioned in some way, or shared in ways that customers wouldn't expect.

v-achippa · ‎11-03-2025

Hi @dbeavon3,

Thank you for the response. In Fabric the spark coordination components(including ZooKeeper) are service-managed and containerized, not customer-managed.
They are isolated at the capacity level, not at the individual workspace level.

On a dedicated capacity, these services run within resources allocated only to that capacity and are not shared with other tenants.
Multiple workspaces under the same capacity will share that capacity’s resources, but the service manages scheduling and isolation internally to prevent one job from affecting another.
On a shared capacity, the coordination layer is multi-tenant and managed by Microsoft’s service fabric layer, but each job runs in its own isolated session.

You are right that these logs can appear even though the component is not customer-managed, they simply reflect transient coordination retries within the platform.

If the timeout changes reduce failures, it confirms a transient condition, If not the Fabric Support can review the backend ZooKeeper health for your job timestamps.

Thanks and regards,

Anjan Kumar Chippa

v-achippa · ‎11-10-2025

Hi @dbeavon3,

As we haven’t heard back from you, we wanted to kindly follow up to check if the solution I have provided for the issue worked? or let us know if you need any further assistance.

Thanks and regards,

Anjan Kumar Chippa