Solved: Notebook taking longer in pipeline, compared to ru...

arlindTrystar · ‎09-20-2024

Can someone explain this to me?

Why is the timing in snapshot details, different from the timing in the run details?

When running the notebook itself, it does not take more than 2-3 minutes. When running as part of a pipeline, it takes longer.

For context: this notebook is part of a pipeline that copies tables from source to Azure, and then this notebook is triggered to load to silver lakehouse. So there are multiple tables being loaded, and after each one is loaded as raw, we trigger this notebook.

Anonymous · ‎09-23-2024

Hi @arlindTrystar ,

Thanks for the reply from frithjof_v .

In Fabric, the reason Notebook runs slower in a pipeline than on its own is that your Notebook is triggered after each table is loaded, and the cumulative effect of sequential operations can increase the overall duration. However Notebook alone runs only once.

The difference in timing between the snapshot details and the run details can be attributed to several factors:

There could be delays in allocating resources or initializing the environment, which are captured in the snapshot but not in the run details.

The snapshot might include additional overhead from logging and monitoring processes that are not accounted for in the run details.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

View solution in original post

frithjof_v · ‎09-23-2024

You can use a Master notebook in the pipeline to trigger child notebook runs.

This way, the Notebook runs can share the same spark session, meaning you won't have to wait for cluster start-up for each notebook run.

https://learn.microsoft.com/en-us/fabric/data-engineering/notebook-utilities#reference-a-notebook

There is also something called Threadpooling which is said to be even faster.

High concurrency in Data pipelines will probably be a no-code, out-of-the-box solution when it gets released, it is on the roadmap:

https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering#investment-areas

However for now I think the Master notebook -> Child notebook pattern is the available option.