Solved: Cost of a Bad Query with Autoscale Enabled in a P1...

dbeavon3 · ‎01-04-2025

We have a P1 with autoscale enabled.

A couple times a week the capacity will autoscale and we will incur a large fee for the next twenty four hours (eg. ~$300, but can be as high as $500).

Typically it is caused by a low-code developer who is in the process of building a report using premium features. They may accidentally run a query for data, while forgetting to include some critical filter (like the fiscal year or the salesperson, or whatever). Or they may cross-join large tables that have no relationships between them.

Here is one example of that happening last week.

... A user ran a single bad query on Dec 25, and it resulted in elevated autoscale for 24 hours (at 160%).

One of the main reasons the autoscale is so expensive is because the elevated capacity doesn’t shut itself off until 24 hours have passed! The end result Is for Microsoft to charge the company $300 for a single bad query from a user. It seems unfair, especially since that user may never even realize they did anything wrong, and may repeat their mistake again and again. This Power BI is a SaaS platform that seeks to sell services directly to their end users. For the cost of $300 for a query, it seems like someone at Microsoft should at least give this user a personal phone call, and thank them for their business! (And explain how to avoid the surcharges the next time.)

Does anyone else struggle with these excessive surcharges when autoscale is enabled? I have looked for features to help mitigate. Eg. there is a new "surge protection" which helps with background usage. But in my example, the problems are triggered by a foreground query, and it was purely accidental. I think the only good option is to turn off autoscale, (thereby suffering with large throttling outages) or else migrate from P1 to F64 which will allow us to perform our capacity-scaling on demand, in a manual way. Another option might be to push them out of the capacity and ask them to do their development work in "shared/pro" capacities. (But that doesn't always seem feasible based on the lack of parity in features between the shared and premium capacity. )

Am I overlooking any other solutions? Does it sound reasonable for our low-code users to be charged an extra $300 for making a mistake in their PQ and DAX?

dbeavon3 · ‎01-09-2025

Found out that there is an enhancement underway to give users a way to gather the "CU" events via "realtime hub". This would allow fabric f64 customers to implement our own scaling operations, as needed. It would avoid the use of the 24-hour autoscale.

View solution in original post

v-karpurapud · ‎01-09-2025

Hi @dbeavon3

As we haven’t heard back from you, we wanted to kindly follow up to check if the solution provided by our super user @lbendlin and @collinq helps you? or let us know if you need any further assistance here?

Your feedback is important to us, Looking forward to your response.

Thanks

lbendlin · ‎01-04-2025

We do not enable autoscale on any of our P capacities. We rather take the interactive delay. Query runtime is limited to 180 seconds. Dataflows are canceled after 5 hours. We are also running a real time report (with the usual 8 minute delay) to check the pulse on all the capacities so we can pre-announce when one of them is about to lock up. Every 15 minutes we are moving "innocent bystander" workspaces to other capacities if they have space. We have daily checks for the main offenders - they get a nicely formatted email detailing how their queries and refreshes impact the capacity they are on. We offer them training sessions to reduce their CU impact. We have a dedicated "dog house" capacity where wo move anyone who is not receptive to our gentle guidance.

Early warning: Notebooks in Fabric are capable of incurring exorbitant costs if you are not vigilant.

dbeavon3 · ‎01-04-2025

I wasn't aware that we could limit query runtime on semantic model queries. I've never had Power BI administrator rights. Can this be further restricted by the developer themselves, if they want to be more even more cautious?

How to configure workloads in Power BI Premium - Power BI | Microsoft Learn

Thanks for the warning about notebooks. I suspected that to be the case. Users are deploying some new pyspark notebooks to production in the next month and it will be a learning experience for everyone. Most of my spark experience is on other platforms.

(At first glance, I think the biggest problem for pyspark notebooks in Fabric is the way the billing/accounting is performed. Spark was never supposed to be billed per-job-time, or per-notebook-time. A normal Spark infrastructure will allow you to pay a fixed cost ... like $3 to $10 an hour for ~5 to 20 large nodes. ... Normally we can expect to run an ENORMOUS number of concurrent notebooks/executors in the Spark cluster for just $10 an hour, and scale back to $0 per hour when idle.... A customer expects to have a large ROI on the fixed cost of these nodes, when running an increasing number of notebooks.

... But if the meter in Fabric is based on the notebook execution time, then our costs will NOT stay fixed. They will just keep increasing in proportion to the number of notebooks that are submitted. It is Microsoft that stands to benefit the most - when the number of notebooks executions keeps increasing, but the underlying infrastructure has leveled out.)

collinq · ‎01-04-2025

HI @dbeavon3 ,

I think that you are giving some of the best solutions - turn off the autoscale or move developers that might not understand what is happening out of the capacity. The other option that I can think of is to provide better training. But, even an experienced developer can accidentally trigger a query that causes the extra charge to occur.

It might work to have a "final approval" developer - one who is allowed to run queries in the autoscale capacity and confirms that everything is good to go prior to the query being run.

I understand how you might feel that this is unfair but how is Microsoft to know or understand when a query is being run incorrectly? I do not think that there is any way for them to know that an individual user has done something in a query that is incorrect.

That does trigger a thought though - maybe you can write some code or a power automate or just a message or a business rule(s) that requires a developer to go through a checklist prior to starting a query. You could have the standard issues that you have seem in your checklist - like "Have you turned on (or off) the appropriate slicers?" or "Did you confirm that the query is hitting the right database?"

Did I answer your question? Mark my post as a solution!

Proud to be a Datanaut!
Private message me for consulting or training needs.

dbeavon3 · ‎01-04-2025

Hi @collinq
Thanks for the reply.

IMO, The product makes a large amount of money when people are making mistakes. As an old-school programmer, I have always relied on sandboxes in pre-prod environments. It should cost little or nothing to make mistakes. A good programmer will make lots of mistakes... but if these mistakes only happen in our developer sandbox then we still count ourselves successful.

But in Fabric the meter is always running, and Microsoft takes a cut out of every mistake! There isn't really any sandbox where the penalty will go away. The only feature that comes close to a sandbox in Fabric are the so-called "pro" workspaces. But they have such a large feature gap, that the "self-service" users who have pro licenses will virtually never use them. (I have almost stopped making this recommendation to these folks.)

I'm glad I'm not overlooking any obvious solutions. I suppose this problem is happening to every company to some degree, but nobody is eager to talk about their mistakes, so they happily pay the $300 mistake-tax (even if it happens weekly).

>> write some code or a power automate or just a message or a business rule(s) that requires a developer ...

IMO, I think this deflects responsibility from the product itself. It also defeats the goals of "self-service" business intelligence. IE. supervising at this level is a lot of work (every PQ and DAX). if these low-code users have to be supervised to this degree by a high-code developer, then it seems like we are taking steps backwards to the original starting point.

I think the product itself needs better feedback loops, where mistakes are isolated and corrected more quickly. These "self-service" users need sufficient motivation to fix their bugs and save some money. It may not be deliberate, yet they are likely to repeat the same mistakes again and again if there is nothing to slow them down.

I may review this topic again after we move to F64. On that SKU, I think the autoscale is implemented very differently than on a P1. There should be middle ground where you allow users to suffer the consequences for a period of time, before the capacity is manually scaled up and the "carry forward" is flushed out.

collinq · ‎01-04-2025

Hey @dbeavon3 ,

Maybe you can look for or create and idea for a "sandbox" or "test" environment that allows a set number of queries or attempts before the charge gets hit???? Or, maybe based on "x" number of houirs per month or some other criteria. I don't know but that might be a Microsoft worthy workaround from their end to help out with this.

A second idea that you have stated may be to give that feedback loop - with all of the AI and other features, I wonder how difficult it would be to use the Performance Analyzer or other features to do a "pre-execute" that gives a quick "did you know that this will run for a long time" kind of message??? I am not sure how easy that is to do BEFORE a query runs.....

You are right - there are lots of companies that are eating these costs (and frankly, some may not even realize or know it).

As a fellow old-schooler, I think that this is part of the catch of the world that we are in now - that "low-code" folks sometimes trigger great costs because they may not understand the consequences. They may not even understand that there is an issue or an error or something missing.

Did I answer your question? Mark my post as a solution!

Proud to be a Datanaut!
Private message me for consulting or training needs.

dbeavon3 · ‎01-09-2025

Found out that there is an enhancement underway to give users a way to gather the "CU" events via "realtime hub". This would allow fabric f64 customers to implement our own scaling operations, as needed. It would avoid the use of the 24-hour autoscale.

dbeavon3 · ‎01-04-2025

>>features to do a "pre-execute" that gives a quick "did you know that this will run for a long time

Right, some of the problem could be solved with low-tech options. Eg. they could simply introduce some additional timeouts (a mandatory timeout per semantic model query, and per PQ mashup query). If query plans are executing logic in parallel, then the timeout should be specified as a max number of vcore/seconds.

This sort of an approach is decades old, and is relatively simple to implement and understand. Even a PBI developer with one year of experience would probably know how long a single query might run, and how long the outer PQ mashup should run in a loop (over a series of dataset queries. ) If this developer was able to specify timeouts to protect themselves, then it would totally avoid the $300 mistake-tax.

But something as simple as timeouts might be asking too much...

Further, I have observed that a surprising number of data-engineering tools are NOT built to run on private infrastructure (local servers or workstations). I don't want to sound overly cynical, but I have to believe there is a reason why this is so. Not only do the vendors want to host your production workloads, but they want to host your development activities as well. Do you think it is possible for Microsoft to create an on-prem version of their ADF pipelines? Of course! Could Snowflake allow you to run an x-small warehouse on a VM image locally? Of course! ... However it is never in their interest to pursue these options. I'd guess that these vendors are making at least 20 or 30% of their profits off developers in a dev environment who are still trying to learn the product, or experiment with various solutions (or they are making unintentional mistakes).

... as such, the vendors lack the incentive to implement such simple features as pre-defined timeouts - because it would lead directly to a large drop in their profits. Even we upvoted these types of suggestions in the ideas portal and the vote reached 1,000 or 100,000, they would still not be well-received. They aren't aligned well with the vendor's strategic goals. It is true of lots of vendors that create data-engineering tools. That said, I'm happy to know that much of the stuff running in "Fabric" is open source, and customers have a number of release valves that can be explored if things get too pricey! The lock-in isn't so bad, assuming you have some knowledge about what is happening under the covers.

Cost of a Bad Query with Autoscale Enabled in a P1 (... is often $300 or more)

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

Cost of a Bad Query with Autoscale Enabled in a P1 (... is often $300 or more)

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025