Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!To celebrate FabCon Vienna, we are offering 50% off select exams. Ends October 3rd. Request your discount now.
Hi everyone,
I am trying to use the custom prompt AI function using Fabric's default LLM settings. I am trying to send over 100k records through the LLM but this is resulting in immediate capacity issues due to the volume of records being sent. I am curious if anyone has also experienced these issues and if so, how did you get around them? Does using my own key and endpoint set up in Azure AI foundary work or will that also create capacity issues. More details below:
Environment
Workspace SKU: F128 (autoscale for Spark ON)
Feature: Custom Prompt AI (Fabric’s default LLM endpoint)
Use case: Classifying ~100 K customer-address records in one pass
Data size: ≈ 1 M tokens per run (each record ~10 tokens × 100 K)
Problem
Whenever I call the endpoint with the full batch, capacity utilization spikes > 300 % and Fabric immediately throttles the workspace. Even if I chunk the dataframe into 5–10 calls, the bursts still blow the CU budget and the overage burndown takes hours.
What I’ve tried
ai.generate_response() on the full Spark DF | Instant overage / throttling |
Micro-batching (10 K rows) | 8–10 bursts, still > 200 % CU |
Switching to a lower-temp model | Token cost drops, CU spike unchanged |
Using Autoscale for Spark pools | Helps after the fact (burndown), not at call time |
Questions for the community
Batch / stream patterns – Has anyone found a sweet-spot batch size or sliding-window pattern that lets you keep CU < 100 % while feeding large dataframes to the LLM?
Async vs. sync calls – Can I fire off smaller async requests and aggregate the responses later without paying the “concurrent CU penalty” in one big spike?
Queue or orchestrator – Does Fabric offer a built-in queue for LLM calls (similar to Synapse Spark job queue) or is everyone rolling their own (e.g., Delta table + Logic Apps / Data Factory orchestrator)?
Model / capacity separation – Is it possible to point Custom Prompt AI to a pay-as-you-go Azure OpenAI resource so the heavy LLM work lands outside the Fabric capacity?
Throttle-aware retry logic – Any sample code that backs off when Peak % > X and resumes when burndown < Y?
I’d love to hear how others are handling high-volume LLM inference in Fabric without upgrading to an even larger SKU or waiting hours for overages to clear.
We are following up once again regarding your query. Could you please confirm if the issue has been resolved through the support ticket with Microsoft?
If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.
Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.
Thank you for your understanding and participation.
Hi @AnthonySottile ,
If your issue still persists, please consider raising a support ticket for further assistance.
To raise a support ticket for Fabric and Power BI, kindly follow the steps outlined in the following guide:
How to create a Fabric and Power BI Support ticket - Power BI | Microsoft Learn
Thanks,
Prashanth Are
MS Fabric community support
hi @AnthonySottile,
As we haven’t heard back from you, we wanted to kindly follow up to check if there is any progress on above mentioned issue. let me know if you still need any further help here.
Thanks,
Prashanth Are
MS Fabric community support
Hi @AnthonySottile,
We would like to follow up to see if the solution provided by the super user resolved your issue. Please let us know if you need any further assistance.
@lbendlin, thanks for your prompt response.
Thanks,
Prashanth Are
MS Fabric community support
If our super user response resolved your issue, please mark it as "Accept as solution" and click "Yes" if you found it helpful.
The only thing I can advise on is 5. Look into capacity Surge Protection. NOTE: That is based on Background Rejection % (the very last tab) and you need to gauge a value that works for you. For example we use 37% for on and 35% for off, but your value will most likely be different.
User | Count |
---|---|
35 | |
15 | |
14 | |
9 | |
8 |
User | Count |
---|---|
50 | |
30 | |
26 | |
17 | |
10 |