Solved: Re: Maintaining notebook Spark session within pipe...

gmangiante · ‎11-03-2023

Hello - I'm working on building a Fabric Pipeline that uses multiple PySpark notebooks within its flow. I'm noticing that, although the notebooks run pretty quickly on their own, they take at least a minute longer when run within the pipeline. My assumption is that this is due to having to start up a new Spark session for each notebook invocation. Could someone confirm this for me? If that is the case, is there any way to maintain the session for the entire pipeline to avoid these extended start-up times? (Note: I don't believe the %run magic will work here, because I need to parameterize each notebook dynamically along the way, but please correct me if I'm wrong.) Thanks!

v-nikhilan-msft · ‎11-03-2023

Hi @gmangiante ,

You are right. Each notebook step would start a new Spark session.

We do have a plan to enable session sharing across pipeline steps with high concurrency for pipelines which would allow you to reuse sessions and avoid additional delays.

The ETA for the deployment is planned for this semester and is currently in design phase . Stay tuned for more updates.

Appreciate your patience.

Hope this helps. Please let us know if you have any further questions. Glad to help.

View solution in original post

SorenSparso · ‎12-12-2023

If anyone finds this thread, it is scheduled for Q2 2024 (https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#concurrency)

High concurrency in pipelines

Estimated release timeline: Q2 2024

In addition to high concurrency in notebooks, we will also enable high concurrency in pipelines. This capability will allow you to run multiple notebooks in a pipeline with a single session.

View solution in original post

v-nikhilan-msft · ‎11-03-2023

Hi @gmangiante ,

You are right. Each notebook step would start a new Spark session.

We do have a plan to enable session sharing across pipeline steps with high concurrency for pipelines which would allow you to reuse sessions and avoid additional delays.

The ETA for the deployment is planned for this semester and is currently in design phase . Stay tuned for more updates.

Appreciate your patience.

Hope this helps. Please let us know if you have any further questions. Glad to help.

SorenSparso · ‎12-12-2023

If anyone finds this thread, it is scheduled for Q2 2024 (https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#concurrency)

High concurrency in pipelines

Estimated release timeline: Q2 2024

In addition to high concurrency in notebooks, we will also enable high concurrency in pipelines. This capability will allow you to run multiple notebooks in a pipeline with a single session.

SorenSparso · ‎12-06-2023

Hi,

This feature would be very helpful. Is there an update on the ETA? Or is it still expected in December?

gmangiante · ‎11-03-2023

Thanks - this totally makes sense, and I was guessing that it was on the roadmap, looking at the current high-concurrency capability for interactive notebooks - that would naturally extend to pipelines, and I'm sure I'm not the only person who's come up with this issue. I look forward to future developments, and I appreciate the quick response!

MartinMason · ‎11-27-2023

We're also experiencing quite a bit of performance issues with pipelines and hoping that high concurrency with help in our case as well.

gmangiante · ‎11-27-2023

Not sure if this is helpful for you, but for now, we've decided to go with pure Spark job definitions rather than leveraging Pipelines. It's not as modular and transparent as a Pipeline would be, but it gets the job done efficiently and operates in batch mode instead of interactive (https://learn.microsoft.com/en-us/fabric/data-engineering/spark-job-concurrency-and-queueing), which lets us schedule refreshes more easily, since they're queued. Would still like to get back to a notebook-powered Pipeline at some point, but having managed Spark available alongside our lakehouse is extremely useful.

ajarora · ‎11-03-2023

To help prioritize this workitem, please submit a new idea on Fabric Pipelines, if it doesnt already exist.

v-nikhilan-msft · ‎11-03-2023

Hi @gmangiante ,

Thanks for using Fabric Community and reporting this.

Apologies for the issue you have been facing. I would like to check are you still facing this issue?

It's difficult to tell what could be the reason for this performance.

I have reached to the internal team for help on this. I will update you once I hear back from them.
Appreciate your patience.

Maintaining notebook Spark session within pipeline

High concurrency in pipelines

High concurrency in pipelines

Helpful resources

Fabric certifications survey

Fabric Monthly Update - April 2024

Fabric Community Update - April 2024