Solved: How to optimize the CU pipeline?

anlebonny · ‎04-07-2025

Hello,

Our project’s data architecture includes multiple notebooks created in Fabric - one for each generated table. I have grouped the notebooks by domain and created one pipeline per domain. Then, I created the main pipeline, where I invoked these domain pipelines and added additional notebooks and semantic model refreshes.

This main pipeline is scheduled to run every hour to deliver fresh data to users. However, it consumes so much capacity that it is not sustainable. Unfortunately, I cannot change the data architecture.

I’ve noticed that each notebook takes at least 7 minutes to launch because a new instance is created each time. I am using the High Concurrency instance and have tagged the scripts within the domain pipelines so that the same instance is reused. This has improved speed somewhat, but I’m not sure what else I can do to achieve better performance. Would using the same tag names in the sessions used in the invoked pipelines and applying them in the notebooks directly within the main pipeline help improve performance?

I’m not sure if I’m taking the right approach. I would appreciate advice on what else I can do to improve performance.

nilendraFabric · ‎04-07-2025

Hi @anlebonny

fantastic blog

https://justb.dk/blog/2025/03/optimizing-spark-with-high-concurrency-mode

please check this as well

https://community.fabric.microsoft.com/t5/Fabric-platform/Optimizing-CU-Usage-in-Microsoft-Fabric/m-...

View solution in original post

v-ssriganesh · ‎04-18-2025

Hi @anlebonny,

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

Thank you.

v-ssriganesh · ‎04-15-2025

Hi @anlebonny,
I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please accept it as a solution and give a 'Kudos' so other members can easily find it.
Thank you.

v-ssriganesh · ‎04-12-2025

Hi @anlebonny,

Thank you for posting your query in the Microsoft Fabric Community Forum, and thanks to @andrewsommer & @nilendraFabric for sharing valuable insights.

Could you please confirm if your query has been resolved by the provided solution? If so, please mark it as the solution. This will help other community members solve similar problems faster.

Thank you.

andrewsommer · ‎04-07-2025

Using consistent session tags can improve performance by ensuring session reuse, which reduces startup overhead. This minimizes the number of new compute instances being provisioned, which can be a cause of the capacity spike. However, tagging alone may not do what you need.

Are you updating everything or only what you need to? Explore partitioning tables and using Delta Lake's MERGE with filtered conditions to limit processing.

Also you might want to look at if running domain pipelines in parallel within a parent pipeline causes resource contention; if so, stagger their execution.

Please mark this post as solution if it helps you. Appreciate Kudos.