The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredAsk the Fabric Databases & App Development teams anything! Live on Reddit on August 26th. Learn more.
I am running a large-scale Spark job in Microsoft Fabric via Livy, where data is read from Delta tables and processed in iterative steps, with periodic checkpoints every few iterations to persist intermediate results. Despite using local checkpointing to reduce overhead, I am consistently hitting "LivyHttpRequestFailure: Submission failed due to error content =[RequestCancelled: upstreamService:livy, timeout:30s] HTTP status code: 504. Trace ID: 72baaa2d-9222-4bcf-aeef-566cb8903980" errors during checkpoint stages, indicating that the job execution time for certain steps exceeds Livy’s timeout threshold.
Worth noting that I ran the same notebook until very recently for a long time without any errors.
Can anyone help me with this?
I tried everything, but nothing worked until I switched the account I was using to run the notebook. A colleague advised me to switch accounts because in the dev account I was using there were a lot of background operations and he thought it might work if I used another account specifically to run these kinds of notebooks. As soon as I changed the account, the notebook ran without any errors. But is this the only way? What if the notebook fails after a month? Do I have to switch accounts then too?
Thanks for sharing the update. Since the job succeeds when run under a different account, the issue seems tied to resource load or background operations in the original dev account rather than Livy itself. Switching accounts works as a workaround, but it’s not the only option.
A better long-term approach would be to:
Reviewing Fabric capacity metrics to check if background jobs are consuming resources.
Coordinating with your admin to isolate heavy Spark workloads onto a dedicated capacity or workspace.
Adjusting background activity in the dev account to reduce contention.
Hope it helps !!
Thank You.
Hi @sreedharshan_10
The LivyHttpRequestFailure error you're encountering, particularly the RequestCancelled: upstreamService:livy, timeout:30s message, indicates that the Livy server is timing out while waiting for a response from your Spark job. This can happen for several reasons, especially in large-scale jobs with iterative processing and checkpoints. try below steps :
Hope this helps !!
I wanted to follow up regarding the LivyHttpRequestFailure error you mentioned earlier, specifically the RequestCancelled: upstreamService:livy, timeout:30s issue.
In the previous response, I shared a set of recommendations, such as increasing the timeout beyond 30 seconds, adjusting checkpoint intervals to reduce overhead, checking for networking or firewall issues, ensuring sufficient resources for Spark, using the Spark UI to identify long-running stages, upgrading to the latest stable Livy/Spark versions, and testing on a smaller dataset to isolate volume-related issues.
Were you able to try any of these steps in your environment? If so, did they help reduce the timeout errors or improve job stability? If the problem persists, let us know we can assist you.
I’m following up on the LivyHttpRequestFailure (RequestCancelled: upstreamService:livy, timeout:30s) issue you raised. In my earlier response, I suggested steps such as extending the timeout, tuning checkpoint intervals, verifying networking/firewall settings, checking Spark resource availability, reviewing long-running stages in the Spark UI, upgrading to the latest stable versions, and testing with smaller datasets.
Have you had a chance to try these recommendations? If so, did they help reduce the timeouts or improve stability?
If the issue remains, please let us know so we can continue troubleshooting with you.
User | Count |
---|---|
4 | |
2 | |
2 | |
2 | |
2 |
User | Count |
---|---|
10 | |
8 | |
7 | |
6 | |
6 |