Re: Livy HTTP Request Error

sreedharshan_10 · ‎08-08-2025

I am running a large-scale Spark job in Microsoft Fabric via Livy, where data is read from Delta tables and processed in iterative steps, with periodic checkpoints every few iterations to persist intermediate results. Despite using local checkpointing to reduce overhead, I am consistently hitting "LivyHttpRequestFailure: Submission failed due to error content =[RequestCancelled: upstreamService:livy, timeout:30s] HTTP status code: 504. Trace ID: 72baaa2d-9222-4bcf-aeef-566cb8903980" errors during checkpoint stages, indicating that the job execution time for certain steps exceeds Livy’s timeout threshold.

Worth noting that I ran the same notebook until very recently for a long time without any errors.

Can anyone help me with this?

sreedharshan_10 · ‎08-20-2025

I tried everything, but nothing worked until I switched the account I was using to run the notebook. A colleague advised me to switch accounts because in the dev account I was using there were a lot of background operations and he thought it might work if I used another account specifically to run these kinds of notebooks. As soon as I changed the account, the notebook ran without any errors. But is this the only way? What if the notebook fails after a month? Do I have to switch accounts then too?

v-aatheeque · ‎08-21-2025

Hi @sreedharshan_10

Thanks for sharing the update. Since the job succeeds when run under a different account, the issue seems tied to resource load or background operations in the original dev account rather than Livy itself. Switching accounts works as a workaround, but it’s not the only option.

A better long-term approach would be to:

Reviewing Fabric capacity metrics to check if background jobs are consuming resources.
Coordinating with your admin to isolate heavy Spark workloads onto a dedicated capacity or workspace.
Adjusting background activity in the dev account to reduce contention.

Hope it helps !!

Thank You.

v-aatheeque · ‎08-25-2025

Hi @sreedharshan_10

Just checking Were you able to review the Fabric capacity metrics or coordinate with your admin to check if background Spark workloads were causing the issue in your dev account?

Since switching accounts worked as a workaround, it would be good to confirm if the long-term steps helped resolve the contention problem.

looking forward for your response !!

v-aatheeque · ‎08-08-2025

Hi @sreedharshan_10

The LivyHttpRequestFailure error you're encountering, particularly the RequestCancelled: upstreamService:livy, timeout:30s message, indicates that the Livy server is timing out while waiting for a response from your Spark job. This can happen for several reasons, especially in large-scale jobs with iterative processing and checkpoints. try below steps :

If you can change the settings, raise the timeout from the default 30 seconds to something higher (e.g., 120 seconds or more) to match the expected duration of your longest Spark stages.
Since you’re doing periodic checkpoints, make sure they’re not happening so often that they slow everything down. Increase the interval between checkpoints if needed.
A 504 error can also mean a networking problem. Ensure there are no connectivity or firewall issues causing requests to drop or time out.
Give your job enough resources (CPU, memory, etc.) so the longer-running steps can finish within the timeout window.
Use the Spark UI to see which stages are taking unusually long. That can help you pinpoint where the slowdown happens.
If you’re on an older Livy or Spark version, upgrade to the latest stable release to benefit from bug fixes and performance improvements.
Run the same job on a reduced dataset. If it succeeds, the issue might be related to the current data volume or complexity.
Hope this helps !!

v-aatheeque · ‎08-12-2025

Hi @sreedharshan_10

I wanted to follow up regarding the LivyHttpRequestFailure error you mentioned earlier, specifically the RequestCancelled: upstreamService:livy, timeout:30s issue.

In the previous response, I shared a set of recommendations, such as increasing the timeout beyond 30 seconds, adjusting checkpoint intervals to reduce overhead, checking for networking or firewall issues, ensuring sufficient resources for Spark, using the Spark UI to identify long-running stages, upgrading to the latest stable Livy/Spark versions, and testing on a smaller dataset to isolate volume-related issues.

Were you able to try any of these steps in your environment? If so, did they help reduce the timeout errors or improve job stability? If the problem persists, let us know we can assist you.

v-aatheeque · ‎08-19-2025

Hi @sreedharshan_10

I’m following up on the LivyHttpRequestFailure (RequestCancelled: upstreamService:livy, timeout:30s) issue you raised. In my earlier response, I suggested steps such as extending the timeout, tuning checkpoint intervals, verifying networking/firewall settings, checking Spark resource availability, reviewing long-running stages in the Spark UI, upgrading to the latest stable versions, and testing with smaller datasets.

Have you had a chance to try these recommendations? If so, did they help reduce the timeouts or improve stability?

If the issue remains, please let us know so we can continue troubleshooting with you.

Livy HTTP Request Error

Helpful resources

Fabric Monthly Update - September 2025

FabCon is coming to Atlanta

Livy HTTP Request Error

Helpful resources

Fabric Monthly Update - September 2025