Re: Capacity Metrics - CU Utilization Alerting

BIDanny · ‎10-23-2024

Good Morning PBI Community, wondering if anyone else is struggling with this. We are attempting to "Right Size" Fabric Capacity SKU based off current reporting needs for various projects. Each project is its own Fabric SKU ranging from F2 to F8 for the more load intensive reporting needs.

We are finding that in certain conditions that some reports are running away with the CU's available within the 1hr decay window, and sometimes we are receiving the "Capacity Limit Exceeded" notification when we attempt to run a Notebook or even present a report through Power BI Embedding in our application. This is troublesome as we have no idea that its coming close to hitting the limit until its too late.

Now the obvious answer is to spend more and bump up the capacity. However, its our opinion that this is reactive and we are seeking a more proactive approach - to increase the capacity before the limits are hit and we are in a position of being throttled.

I spoke with our Azure Digital Solution Architect and he said he'd poke around internally, but suggested I post here (probably should have came here first 😀 but I digress)

I proposed a feature that would allow us to create Alerts using Azure Metrics similar to how we would with a VM that is consuming too much CPU or is unavailable - seems to make sense to have all the alerts/monitoring in one place. Since that isn't currently available my next choice was to use the Fabric Capacity Metrics dataset. Paradoxically - the Capacity Metrics dataset uses CU in order to evaluate the CU's used by the capacity 😅 - and is still a reactive approach, since we have to refresh the dataset and its not exactly "current".

In review of the Admin - REST API (Power BI Power BI REST APIs) | Microsoft Learn - there isnt really anything in here that I've seen (maybe I overlooked) that provides CU utilization across the entire Capacity, just for Datasets/Dataflows.

Any help our guidance would be appreciated.

TIA!

Dan

BIDanny · ‎10-23-2024

https://youtu.be/wnrj2q_MILU?si=g6ebtdh9fkKftmfH

This was provided to me by our Azure guy, will follow up if this ends up working out!

lbendlin · ‎10-23-2024

Pat Mahoney's video is all very well but keep in mind that the Kusto store behind the Capacity Metrics App has a large delay of at least 8 minutes. Think of your capacities as mini suns. You won't know if they exploded until eight minutes after the fact. Real time it is not, and you will always be in reactive mode.

(if you want a fun exercise - connect to the semantic model of the metrics app, look at their data model and feast your eyes...)

BIDanny · ‎10-23-2024

Yep. Seems like the best I can do is see it getting close - and size up before it gets to the point of being throttled. Autoscaling would be so nice.

lbendlin · ‎10-23-2024

That's what we do

Keep an eye on the background trend. If you don't like what you see, identify the culprits and talk to them in a nice, non-threatening way.

lbendlin · ‎10-23-2024

Most of what you say is spot on. There are a lot more nuances to this - for example the CU cost is not only a function of the duration, but also the computational complexity. So even if you had better visibility of what is in flight (which so far is purely manual, for example via the monitoring hub or via REST API calls), you still won't know when exactly the brown stuff hits the spinny thing, and how big of a splat it will make.

I assume you have considered the scaling options Scale your Fabric capacity - Microsoft Fabric | Microsoft Learn and the auto size options. Evaluate and optimize your Microsoft Fabric capacity - Microsoft Fabric | Microsoft Learn

As you mentioned - at the end of the day it all boils down to cost. Define how much you are willing to pay and then get the biggest capacity that fits the envelope. F8 is, uhm, er, rather small.

BIDanny · ‎10-23-2024

Thank you for your reply. Yeah, I am aware that the CU cost isnt just the compute - its not so much what causes the Throttling, its finding out when its getting close so I can perform that manual scale up - since there is no auto scale option available.

I've resigned myself that I'll need to build some proactive monitoring logic (Probably a Runbook or ADF Pipeline) that will routinely query some REST service - but like I said I cant find anything that provides the granularity that we get when we run the Capacity Metrics App.

Some of the below may not be fully accurate but what I've gleaned is the following:

1. The Fabric SKU has a Maximum CU limit

2. CU Usage "Decays" after 1 hour from its use - i.e. If I have 10k CU available, and I have a report that uses 9k CU - I only have 1k left in my budget until that 9k falls off an hour later

3. Bursting is allowed with limitations, but sustained overage will result in the capacity limit exceeded (And a complete shutdown of all Interactive and Background services on that Fabric Capacity). Im not sure as of now how much Bursting is acceptable, is it 2 minutes? 5 minutes? less? (MSFT, I'd love to know hehe)

My goal is to obtain reliable calculation of this Usage vs Available CU - and if I have a product that is moving closer to reaching its max - we can send alerting to our Teams Group (I have a PowerAutomate flow th at does this) for someone to bump the SKU up before we hit that threshold. I know that I can get this data from the Capacity Metrics semantic model, but refreshing that every 15 minutes seems like a great way to blow through those available CU and still be able to miss a sudden sustained spike that I might otherwise have available if I ping every 5.

Open to creative solutions here, the world is our oyster here!

lbendlin · ‎10-23-2024

Your #2 is inaccurate.

If you exceed the CUs for a given 30 second slot then debt will be accrued for 10 minutes (you will be allowed to consume more CUs than you paid for). You can pay these debts back if your CU consumption falls below the limit. This is referred to as Burndown.

If you exceed the 10 minute overage then Interactive Throttling will be applied - queries will be delayed.

BUT - if you exceed the capacity for 60 minutes this is when the capacity loses its, uhm, cool. Interactive queries will be rejected, and if your refreshes keep completing then there is a high chance that your capacity will lock up hard for many hours.

Your main mitigation would be to move everything off that capacity onto another one. This assumes you have one lying around... Which is most likely not the case.

BIDanny · ‎10-23-2024

Actually we will absolutely have multiple SKU lying around.

Microsoft has even said my proposed architecture is a-typical but it kind of works like this.

We are embedding reports through a MVC style application. This application supports multi-tenancy.
For each customer, we are deploying a workspace with everything needed to produce their reports.
Not all customers are created equal, as some may have a couple users running a couple reports all day, some may have a dozen users firing off many reports an hour.
Maybe we can "fit" 5-6 customers on one F4 sku, where another customer needs a Dedicated F8 SKU for consistent operation.
The idea is that we can kind of cup-and-ball the workspaces around to spread the load as evenly as possible.. a sort of load balancing for Power BI.

Seems like it would be six-vs-one-half-dozen if we had four F2s vs one F8 - but what we're trying to avoid is for Customer C to run away with all the capacity and affect customers A B and D.

Capacity Metrics - CU Utilization Alerting

Helpful resources

Join us at the Microsoft Fabric Community Conference

Power BI Monthly Update - February 2025

Fabric Community Update - February 2025

Join us at the 2025 Microsoft Fabric Community Conference

Capacity Metrics - CU Utilization Alerting

Helpful resources

Join us at the Microsoft Fabric Community Conference

Power BI Monthly Update - February 2025

Fabric Community Update - February 2025