Earn a 50% discount on the DP-600 certification exam by completing the Fabric 30 Days to Learn It challenge.
Hello - I'm working on building a Fabric Pipeline that uses multiple PySpark notebooks within its flow. I'm noticing that, although the notebooks run pretty quickly on their own, they take at least a minute longer when run within the pipeline. My assumption is that this is due to having to start up a new Spark session for each notebook invocation. Could someone confirm this for me? If that is the case, is there any way to maintain the session for the entire pipeline to avoid these extended start-up times? (Note: I don't believe the %run magic will work here, because I need to parameterize each notebook dynamically along the way, but please correct me if I'm wrong.) Thanks!
Solved! Go to Solution.
Hi @gmangiante ,
You are right. Each notebook step would start a new Spark session.
We do have a plan to enable session sharing across pipeline steps with high concurrency for pipelines which would allow you to reuse sessions and avoid additional delays.
The ETA for the deployment is planned for this semester and is currently in design phase . Stay tuned for more updates.
Appreciate your patience.
Hope this helps. Please let us know if you have any further questions. Glad to help.
If anyone finds this thread, it is scheduled for Q2 2024 (https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#concurrency)
Estimated release timeline: Q2 2024
In addition to high concurrency in notebooks, we will also enable high concurrency in pipelines. This capability will allow you to run multiple notebooks in a pipeline with a single session.
Hi @gmangiante ,
You are right. Each notebook step would start a new Spark session.
We do have a plan to enable session sharing across pipeline steps with high concurrency for pipelines which would allow you to reuse sessions and avoid additional delays.
The ETA for the deployment is planned for this semester and is currently in design phase . Stay tuned for more updates.
Appreciate your patience.
Hope this helps. Please let us know if you have any further questions. Glad to help.
If anyone finds this thread, it is scheduled for Q2 2024 (https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#concurrency)
Estimated release timeline: Q2 2024
In addition to high concurrency in notebooks, we will also enable high concurrency in pipelines. This capability will allow you to run multiple notebooks in a pipeline with a single session.
Hi,
This feature would be very helpful. Is there an update on the ETA? Or is it still expected in December?
Thanks - this totally makes sense, and I was guessing that it was on the roadmap, looking at the current high-concurrency capability for interactive notebooks - that would naturally extend to pipelines, and I'm sure I'm not the only person who's come up with this issue. I look forward to future developments, and I appreciate the quick response!
We're also experiencing quite a bit of performance issues with pipelines and hoping that high concurrency with help in our case as well.
Not sure if this is helpful for you, but for now, we've decided to go with pure Spark job definitions rather than leveraging Pipelines. It's not as modular and transparent as a Pipeline would be, but it gets the job done efficiently and operates in batch mode instead of interactive (https://learn.microsoft.com/en-us/fabric/data-engineering/spark-job-concurrency-and-queueing), which lets us schedule refreshes more easily, since they're queued. Would still like to get back to a notebook-powered Pipeline at some point, but having managed Spark available alongside our lakehouse is extremely useful.
To help prioritize this workitem, please submit a new idea on Fabric Pipelines, if it doesnt already exist.
Hi @gmangiante ,
Thanks for using Fabric Community and reporting this.
Apologies for the issue you have been facing. I would like to check are you still facing this issue?
It's difficult to tell what could be the reason for this performance.
I have reached to the internal team for help on this. I will update you once I hear back from them.
Appreciate your patience.