How to Get Value out of *all* 24 Hours of Capacity...

dbeavon3 · ‎05-08-2024

The Power BI team keeps changing what is meant by P1 capacity.

In the past there were distinct cores for background and interactive operations, and it was easier for the layman to understand what they are paying for, and try to maximize the bang for the buck.

Nowadays there are concepts like "carry forward" and "smoothing" that allow us to use MORE (or less) CU than we are owed at any given time. The theory is that these concepts are supposed to make a capacity easier for us to manage. However in practice I have NOT found that to be the case. For example, yesterday a PBI developer consumed a massive amount of CU all at once. Even after he quit what he was doing, it took *** 2 HOURS *** for the throttling to stop in our P1, and for our production workloads to start working again!!! This is totally insane. It is very hard for me to believe that Microsoft is making these types of PBI changes in the interest of the customers. The only thing it does (in practice) is it convinces decision-makers that we need to spend more money on Power BI capacity, since that is the only available solution which is promoted by Microsoft as the way to avoid these accidental episodes.

Recently these types of episodes are happening a LOT more frequently than they did in the past. One of the biggest changes they made to P1 is to intermix the accounting for background and interactive operations. Previously the background operations could be scheduled as frequently as needed without impacting the CU available to "interactive" users. Background operations could either be spread across the day, or they could be concentrated at night. Some types of background operations happen hourly and some larger ones happen once a day. Background operations only needed to take into account the capacity used by OTHER background operations. There was no impact on CU for "interactive users" in any case.

Now things are different.

Now the CU for background operations is accrued against the exact same CU that is used on "interactive" operations. What this means is the CU becomes fungible, and background operations steal from "interactive" operations, without our ability to manage them separately. Worse yet, it is distributed evenly across the entire day. See the image below where the blue bar at the bottom represents the background operations that are distributed across the day. As a result of this ambient CU usage that is distributed across the day, we will often "over utilize" our capacity during interactive operations , and that will result in throttling.

... continued..

But the most unfortunate thing is the big purple box you see above, which represents CU's that we paid for which are INACCESSIBLE to us. Even if we wanted to schedule all our background operations to happen during the overnight hours, it doesn't help because the CU's are still borrowed from future capacity instead! And we have no way of fully accessing the CU's that would otherwise be available to us overnight. It could represent up to 20% or 30% of the value that we are paying for.

Can anyone share a strategy for getting access to the CU's in the purple box?

lbendlin · ‎05-08-2024

We have lodged multiple complaints about that stuff with Microsoft.

- We have multiple capacities yet are unable to pay our CU debt in one capacity with "free" CUs in the other capacities.

- There is zero information about in-flight issues. You only see the problem when it is too late.

- Microsoft are refusing to add Gen 1 dataflows to the Monitoring hub. Gen 1 dataflows are one of the biggest risks as they have no "natural" timeouts. A single dataflow finishing after 18 hrs (no matter if successful or not) can easily ice up your capacity for six hours. We have now resorted to forcefully cancel all dataflows running longer than 5 hrs (the semantic model timeout limit). I really wish we wouldn't need to do that.

Microsoft's aim is clear. Get rid of contractual pricing and move everyone to consumption pricing. Needless to say, that does not match our goals at all.

dbeavon3 · ‎05-09-2024

So what would a customer do to gain access to the CU's living in the purple box? Maybe we share a capacity with another customer in China?

I've heard stories of companies who buy contractual services from Tableau, and then resell them to a set of downstream customers.

I'm sure Microsoft would discourage that... but at least they should see that it is unfair for the overnight CU's to go to waste. We should be able to spend those CU's on on overnight background operations.

lbendlin · ‎05-09-2024

"Overnight" is not a concept in our company as we operate 24/7. But we do have varying load across capacities. Would be nice to be able to treat them all as one pool (without the nightmare of having to combine them all into a P5 or worse)

dbeavon3 · ‎05-09-2024

I understand that different teams want isolated resources if it prevents them from being impacted by outages.

At a high level, what is the nightmare that happens, as you increase one monolithic capacity from a P1 up to P5?

My plan is to have a distinct capacity for certain predictable workloads (kiosks, and operational reports). And it would remain a P1 based on the fact that the "interactive" operations will remain consistent and predictable. This capacity would be called the "high priority" capacity.

For all other workloads (ad-hoc, development, and analytical reports), we would start dumping them in another monolithic capacity that would slowly grow over time to P2, P3 etc. It would be a "low priority" capacity, aka "doghouse" capacity and the I.T. department would support it on a limited and best-effort basis. Most of the problems would be things that the users of the capacity would have to sort out amongst themselves.

I'm not sure if this solves everyone's problems, but it ensures that kiosks run smoothly and executives don't call I.T. to complain about the fact that their mission-critical refresh operation got frozen for two consecutive hours .

lbendlin · ‎05-09-2024

Yes, issue isolation is the biggest consideration. Our "doghouse" capacity is basically a hall of shame for all workloads that have previously caused other capacities to crash/ice up and are now condemned to eternal suffering (well, until they improve their performance). We also have the "high priority" capacity that is tightly restricted to specific data subjects and their audiences.

dbeavon3 · ‎05-09-2024

@lbendlin

Is your doghouse capacity still contractual? Does it use autoscale? Is it very big? A P5? Is it shared by multiple departments? Who ends up paying for it? It seems to me that having one giant doghouse makes sense, since the workloads are unpredictable and impossible to manage. It would be very costly to have lots of smaller P1 doghouses that are only partly used. The only problem with having one giant, badly managed doghouse is trying to find someone to pay for it.

I really hate the fact that managing Power BI involves so many non-technical considerations. You have to have a lot of "soft skills" for this type of thing. Since there is such a wide range in skills amongst the developers, it is very hard to give everyone special attention. You eventually give up and focus on the non-technical side of things, like how to get people to pay for their CU's, and how to agree on isolating capacities from each other. Maybe it is best to allow Microsoft to convert the doghouse capacities to consumption-based capacities after all... so that I.T. doesn't have to deal with the finger-pointing between different departments, and the question of who is going to pay that bill.

lbendlin · ‎05-09-2024

Why would I give the offenders a big platform? The doghouse is a P2.

We are not yet in the cross charge business, most capacities are centrally managed. But we're getting there if we want it or not.

Speaking of soft skills - Any form of limiting user activities may be technically correct but it is the opposite of what we ultimately want to achieve - use Power BI to gain insights and derive actions. I strongly dislike having to have a doghouse in the first place.

dbeavon3 · ‎05-09-2024

A P2 is still big in my book. Over 100K. Someone has to pay for that. And that is when the finger-pointing starts....

How to Get Value out of all 24 Hours of Capacity (CU in purple box)

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Value out of *all* 24 Hours of Capacity (CU in purple box)

Helpful resources

Fabric certifications survey

Power BI Monthly Update - April 2024

Fabric Community Update - April 2024

How to Get Value out of all 24 Hours of Capacity (CU in purple box)