Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Ask the Fabric Databases & App Development teams anything! Live on Reddit on August 26th. Learn more.

Reply
sreedharshan_10
Frequent Visitor

Livy HTTP Request Error

I am running a large-scale Spark job in Microsoft Fabric via Livy, where data is read from Delta tables and processed in iterative steps, with periodic checkpoints every few iterations to persist intermediate results. Despite using local checkpointing to reduce overhead, I am consistently hitting "LivyHttpRequestFailure: Submission failed due to error content =[RequestCancelled: upstreamService:livy, timeout:30s] HTTP status code: 504. Trace ID: 72baaa2d-9222-4bcf-aeef-566cb8903980" errors during checkpoint stages, indicating that the job execution time for certain steps exceeds Livy’s timeout threshold.

Worth noting that I ran the same notebook until very recently for a long time without any errors.

Can anyone help me with this?

5 REPLIES 5
sreedharshan_10
Frequent Visitor

I tried everything, but nothing worked until I switched the account I was using to run the notebook. A colleague advised me to switch accounts because in the dev account I was using there were a lot of background operations and he thought it might work if I used another account specifically to run these kinds of notebooks. As soon as I changed the account, the notebook ran without any errors. But is this the only way? What if the notebook fails after a month? Do I have to switch accounts then too?

Hi @sreedharshan_10 

Thanks for sharing the update. Since the job succeeds when run under a different account, the issue seems tied to resource load or background operations in the original dev account rather than Livy itself. Switching accounts works as a workaround, but it’s not the only option.

 

A better long-term approach would be to:

  • Reviewing Fabric capacity metrics to check if background jobs are consuming resources.

  • Coordinating with your admin to isolate heavy Spark workloads onto a dedicated capacity or workspace.

  • Adjusting background activity in the dev account to reduce contention.

Hope it helps !!

Thank You.

 

v-aatheeque
Community Support
Community Support

Hi  @sreedharshan_10 

The LivyHttpRequestFailure error you're encountering, particularly the RequestCancelled: upstreamService:livy, timeout:30s message, indicates that the Livy server is timing out while waiting for a response from your Spark job. This can happen for several reasons, especially in large-scale jobs with iterative processing and checkpoints. try below steps :

 

  • If you can change the settings, raise the timeout from the default 30 seconds to something higher (e.g., 120 seconds or more) to match the expected duration of your longest Spark stages.
  • Since you’re doing periodic checkpoints, make sure they’re not happening so often that they slow everything down. Increase the interval between checkpoints if needed.
  • A 504 error can also mean a networking problem. Ensure there are no connectivity or firewall issues causing requests to drop or time out.
  • Give your job enough resources (CPU, memory, etc.) so the longer-running steps can finish within the timeout window.
  • Use the Spark UI to see which stages are taking unusually long. That can help you pinpoint where the slowdown happens.
  • If you’re on an older Livy or Spark version, upgrade to the latest stable release to benefit from bug fixes and performance improvements.
  • Run the same job on a reduced dataset. If it succeeds, the issue might be related to the current data volume or complexity.

    Hope this helps !!

     

 

Hi @sreedharshan_10 

I wanted to follow up regarding the LivyHttpRequestFailure error you mentioned earlier, specifically the RequestCancelled: upstreamService:livy, timeout:30s issue.

 

In the previous response, I shared a set of recommendations, such as increasing the timeout beyond 30 seconds, adjusting checkpoint intervals to reduce overhead, checking for networking or firewall issues, ensuring sufficient resources for Spark, using the Spark UI to identify long-running stages, upgrading to the latest stable Livy/Spark versions, and testing on a smaller dataset to isolate volume-related issues.

 

Were you able to try any of these steps in your environment? If so, did they help reduce the timeout errors or improve job stability? If the problem persists, let us know we can assist you.

Hi @sreedharshan_10 

 

I’m following up on the LivyHttpRequestFailure (RequestCancelled: upstreamService:livy, timeout:30s) issue you raised. In my earlier response, I suggested steps such as extending the timeout, tuning checkpoint intervals, verifying networking/firewall settings, checking Spark resource availability, reviewing long-running stages in the Spark UI, upgrading to the latest stable versions, and testing with smaller datasets.

Have you had a chance to try these recommendations? If so, did they help reduce the timeouts or improve stability?

If the issue remains, please let us know so we can continue troubleshooting with you.

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.