Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
robertozsr
Regular Visitor

High CU consumption from Fabric notebooks — optimization and architecture advice

Hello everyone,
I’m building a Lakehouse pipeline with a medallion architecture (bronze → silver → gold). Gold layer feeds a semantic model for Power BI reports. I observed very high Compute Unit usage from a few notebooks (screenshot attached). This is OK for testing, but not sustainable in production. It even reached 50,000 CUs

image.png

 Main questions:

  • Is heavy CU usage expected from notebooks because they run Spark transformations?

  • If optimizing the notebooks gives limited savings, is it better to move transformations to a data-warehouse (SQL-first) to reduce costs of consumptions?

  • Would a hybrid design (Delta Lakehouse for ingestion and staging + DW for the medallion architecture) be a recommended pattern? What are the trade-offs (cost, performance, maintainability)?

  • Any practical tips: cluster sizing/auto-pause, caching, partitioning, query pushdown, monitoring & cost control?

I’m learning data engineering, any guidance or references are welcome. Thanks!

1 ACCEPTED SOLUTION
v-tejrama
Community Support
Community Support

Hi @robertozsr ,

 

The high CU usage you’re seeing is quite common when running Spark notebooks in Fabric, especially within a medallion-style Lakehouse setup. Spark is a distributed compute engine, and even relatively simple transformations can consume significant resources depending on data volume, partitioning, and how efficiently the code is written. If the workload is primarily reading large datasets, performing joins, or writing out multiple intermediate files, those operations can quickly drive up CU costs.

 

It’s worth taking a close look at the efficiency of your notebook logic and how the Spark cluster is behaving. Unnecessary recomputations, wide joins, or unoptimized partitions can cause Spark to process much more data than needed. Reviewing the Spark UI can help you pinpoint which stages are consuming the most time or resources. It’s also important to make sure each notebook explicitly stops the Spark session when it finishes. If the session isn’t closed, the cluster can remain active and continue consuming compute units even when there’s no work being done.

In production, a solid approach is to rely on Spark primarily for ingestion and heavy transformations in your bronze and silver layers, where distributed compute provides real benefits.

 

Once your data is cleaned and structured, it’s often more efficient to move the gold-layer transformations into a Fabric Data Warehouse. The SQL-based compute in the warehouse is more predictable and cost-efficient for these curated, relational workloads, and it fits naturally with Power BI models. Many teams find that this hybrid approach using the Lakehouse for data preparation and the Warehouse for serving analytics offers the best balance between flexibility, performance, and cost.

 

For cost management, make sure your Spark pool is appropriately sized for the workload, and use auto-pause or session timeouts so clusters don’t sit idle. Caching and partitioning can help, but should be applied selectively to avoid unnecessary overhead. It’s also a good idea to monitor CU consumption through the Fabric Capacity Metrics app to identify long-running or idle sessions that may need optimization.

 

By refining your notebook logic, managing session lifecycles carefully, and adopting a hybrid Lakehouse and Warehouse design, you should see much more sustainable CU usage in production without compromising performance.

 

Best Regards,
Tejaswi.
Community Support

View solution in original post

8 REPLIES 8
v-tejrama
Community Support
Community Support

Hi @robertozsr ,

 

The high CU usage you’re seeing is quite common when running Spark notebooks in Fabric, especially within a medallion-style Lakehouse setup. Spark is a distributed compute engine, and even relatively simple transformations can consume significant resources depending on data volume, partitioning, and how efficiently the code is written. If the workload is primarily reading large datasets, performing joins, or writing out multiple intermediate files, those operations can quickly drive up CU costs.

 

It’s worth taking a close look at the efficiency of your notebook logic and how the Spark cluster is behaving. Unnecessary recomputations, wide joins, or unoptimized partitions can cause Spark to process much more data than needed. Reviewing the Spark UI can help you pinpoint which stages are consuming the most time or resources. It’s also important to make sure each notebook explicitly stops the Spark session when it finishes. If the session isn’t closed, the cluster can remain active and continue consuming compute units even when there’s no work being done.

In production, a solid approach is to rely on Spark primarily for ingestion and heavy transformations in your bronze and silver layers, where distributed compute provides real benefits.

 

Once your data is cleaned and structured, it’s often more efficient to move the gold-layer transformations into a Fabric Data Warehouse. The SQL-based compute in the warehouse is more predictable and cost-efficient for these curated, relational workloads, and it fits naturally with Power BI models. Many teams find that this hybrid approach using the Lakehouse for data preparation and the Warehouse for serving analytics offers the best balance between flexibility, performance, and cost.

 

For cost management, make sure your Spark pool is appropriately sized for the workload, and use auto-pause or session timeouts so clusters don’t sit idle. Caching and partitioning can help, but should be applied selectively to avoid unnecessary overhead. It’s also a good idea to monitor CU consumption through the Fabric Capacity Metrics app to identify long-running or idle sessions that may need optimization.

 

By refining your notebook logic, managing session lifecycles carefully, and adopting a hybrid Lakehouse and Warehouse design, you should see much more sustainable CU usage in production without compromising performance.

 

Best Regards,
Tejaswi.
Community Support

Hi @robertozsr ,


I wanted to follow up and see if you had a chance to review the information shared. If you have any further questions or need additional assistance, feel free to reach out.

 

Thank you.

I have a follow up question:

per each layer schema (bronze, silver, gold)  I created a bronze, silver, gold notebooks, so in those notebooks i perform the transformations, joins, saving tables, and validations as well. But I am encoutering slower notebook. So Probably the better workflow would be splitting by group of transformation rather than having a big notebook per each layer? 

And now that I think something else that is consuming lot of CU are all the validation steps I am performing against the delta tables: when i create a delta table, as good practice I run spark.sql to validate the data, but now I recognise tha this consumes a lot too. How could I perform validation without using SPARK? the SQL analytics end point? 

Thanks a lot!

Hello, and sorry again for the late reply.

I’ve read your message carefully. First of all, thank you for the detailed feedback. I think the CU usage mainly comes down to two areas:

  1. My code logic and optimization.
    I need to use the Spark UI to identify which parts of the workflow are consuming resources and then refactor accordingly. The dataset I’m working with right now is quite small, around 28,975 rows (small subset for testing).

  2. Architecture.
    The architectural suggestion you gave makes a lot of sense. Having a bronze and silver layer in the lakehouse, then feeding the silver data into the warehouse, and finally building the gold layer and the semantic model for power bi there seems like the right approach.

    I see the bronze and silver layers essentially as a staging area, leveraging Spark’s parallel processing. Then the warehouse serves as the more traditional environment to deliver clean, modeled data to BI users.

Regarding the data scientists: for their datasets, they can pull what they need directly from the silver layer, right? That way they can prepare and transform their datasets independently.

Thanks again for all the clarification and the time you took to explain everything!

lbendlin
Super User
Super User

Educate your developers that they have to terminate the Spark session as part of the notebook, as the last step.

Hello, thanks for the suggestion. But isn't the same as just stop on the top left the notebook execution?  You are saying that as extra safe guard? So when the notebook run top to bottom, we enforce stopping the session once the notebook is executed i guess

nielsvdc
Continued Contributor
Continued Contributor

Hi @robertozsr, the CU usage of a notebook highly depends on a number of things. Is your notebook code efficient? How large is your data size? This is also the reason why we cannot give you accurate advice on your questions, because we don't know your situation.

 

If you are processing a relatively small amount of data, then probably your code is not efficient and is causing the CU usage. If you are processing a relatively large amount of data, your code may still not be efficient enough, but the amount of data contributes to a higher CU usage.

 

Can you tell us more about your situation and about the process you are doing that causes the high CU usage?


Hello, thanks for the questions; they gave me the right input to reflect: the dataset is fairly small. So I am pretty sure that the issue lays on my code efficency.

For example:

# Write into gold.table
df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("gold.table")


I am pretty sure that the .mode overwrite it consumes CU, because every time needs to rewrite all the table. i used this mode just to not over complicated, but in production this will be incrementally.

 

Regarding the .option. I do not know if that increases CUs.

 

 

I used window function in my sparksql queries, which I guess that consume CUs due to continues repartition?

 

I used several left join.

 

So I think the issues lies primarly in my code logic and efficency. 

I will expand my investigation using SPARK ui

 

Thanks for the feedback

 

Helpful resources

Announcements
November Fabric Update Carousel

Fabric Monthly Update - November 2025

Check out the November 2025 Fabric update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.