High CU consumption from Fabric notebooks — optimi...

robertozsr

Hello everyone,
I’m building a Lakehouse pipeline with a medallion architecture (bronze → silver → gold). Gold layer feeds a semantic model for Power BI reports. I observed very high Compute Unit usage from a few notebooks (screenshot attached). This is OK for testing, but not sustainable in production. It even reached 50,000 CUs

Main questions:

Is heavy CU usage expected from notebooks because they run Spark transformations?
If optimizing the notebooks gives limited savings, is it better to move transformations to a data-warehouse (SQL-first) to reduce costs of consumptions?
Would a hybrid design (Delta Lakehouse for ingestion and staging + DW for the medallion architecture) be a recommended pattern? What are the trade-offs (cost, performance, maintainability)?
Any practical tips: cluster sizing/auto-pause, caching, partitioning, query pushdown, monitoring & cost control?

I’m learning data engineering, any guidance or references are welcome. Thanks!

v-tejrama

Hi @robertozsr ,

The high CU usage you’re seeing is quite common when running Spark notebooks in Fabric, especially within a medallion-style Lakehouse setup. Spark is a distributed compute engine, and even relatively simple transformations can consume significant resources depending on data volume, partitioning, and how efficiently the code is written. If the workload is primarily reading large datasets, performing joins, or writing out multiple intermediate files, those operations can quickly drive up CU costs.

It’s worth taking a close look at the efficiency of your notebook logic and how the Spark cluster is behaving. Unnecessary recomputations, wide joins, or unoptimized partitions can cause Spark to process much more data than needed. Reviewing the Spark UI can help you pinpoint which stages are consuming the most time or resources. It’s also important to make sure each notebook explicitly stops the Spark session when it finishes. If the session isn’t closed, the cluster can remain active and continue consuming compute units even when there’s no work being done.

In production, a solid approach is to rely on Spark primarily for ingestion and heavy transformations in your bronze and silver layers, where distributed compute provides real benefits.

Once your data is cleaned and structured, it’s often more efficient to move the gold-layer transformations into a Fabric Data Warehouse. The SQL-based compute in the warehouse is more predictable and cost-efficient for these curated, relational workloads, and it fits naturally with Power BI models. Many teams find that this hybrid approach using the Lakehouse for data preparation and the Warehouse for serving analytics offers the best balance between flexibility, performance, and cost.

For cost management, make sure your Spark pool is appropriately sized for the workload, and use auto-pause or session timeouts so clusters don’t sit idle. Caching and partitioning can help, but should be applied selectively to avoid unnecessary overhead. It’s also a good idea to monitor CU consumption through the Fabric Capacity Metrics app to identify long-running or idle sessions that may need optimization.

By refining your notebook logic, managing session lifecycles carefully, and adopting a hybrid Lakehouse and Warehouse design, you should see much more sustainable CU usage in production without compromising performance.

Best Regards,
Tejaswi.
Community Support

lbendlin

Educate your developers that they have to terminate the Spark session as part of the notebook, as the last step.

nielsvdc

Hi @robertozsr, the CU usage of a notebook highly depends on a number of things. Is your notebook code efficient? How large is your data size? This is also the reason why we cannot give you accurate advice on your questions, because we don't know your situation.

If you are processing a relatively small amount of data, then probably your code is not efficient and is causing the CU usage. If you are processing a relatively large amount of data, your code may still not be efficient enough, but the amount of data contributes to a higher CU usage.

Can you tell us more about your situation and about the process you are doing that causes the high CU usage?