Dataflow Gen2 load/refresh awfully slow

EMark · ‎05-06-2024

First of all, apologies if this topic is not for this forum (I couldn't find a more appropriate one).

I have a PBI report published in a PBI Service with a connection to SQL Server with two queries inside the report. The report is not big, one of the tables has 7 columns and 100 rows, the other table has 9 columns and 75K rows, 700KB size, i.e. nothing big. In fact this is the smallest report of the whole BI suite we have. This report takes between 20 sec and 30 sec to refresh daily, sometimes even 15 sec. So far so good.

Now the issue which puzzles me.

We use Microsoft Fabric and we have a roadmap for full automation of different processes. Before we switched to Fabric we were using the usual PBI web service, but we went for Fabric because of the multiple good ideas and potential options which the product offers or it will. However, not surprisingly, we discovered that it has tons of bugs and glitches, which we would expect to be addressed and fixed within reasonable time. But, some of these problems are related to super basic functions or basic performance, and frankly are very disappointing.

Regarding the report what we did is to move the two queries out of the report and to add them to a stored procedure.

This is the flow (all of this happens in Fabric pipeline):

We pull the required data from SQL Server (cloud based) in 2 dataflows (gen2) staging tables
SP runs right after the two stg tables are being updated, takes the data from the tables and inserts it to reporting tables in the data warehouse
After the SP finishes the last step in the pipeline is to trigger the PBI refresh. The PBI report (dashboard) reads directly from the two tables (select * from ... #XYZ).

All the above 3 steps take between 5 and 6 minutes which is ridiculous. We use the same sql server, same tables etc in both cases, however, we have this huge performance difference - when the query runs from inside the report (20-30 sec) and when we move the query to a separate process outside the report it takes 6 minutes using the same premises and same resources!?

Any ideas, suggestions, explanations etc are more than welcome.

Thanks,

EM

johnbasha33 · ‎05-06-2024

@EMark

It's indeed puzzling that you're experiencing such a significant difference in performance between running the queries inside Power BI and running them through your stored procedure. Here are some potential reasons and suggestions to explore:

1. **Data Transfer Overhead**: When running the queries directly inside Power BI, the data transfer between the SQL Server and Power BI Service is optimized and potentially more efficient compared to running the stored procedure separately. Check if there are any bottlenecks or limitations in the data transfer process when using the stored procedure.

2. **Stored Procedure Optimization**: Review the stored procedure code to ensure it's optimized for performance. Are there any inefficient queries, unnecessary joins, or resource-intensive operations that could be optimized? Consider using SQL Server Profiler to analyze the execution plan and identify any areas for improvement.

3. **Concurrency and Resource Allocation**: Check if there are any concurrency or resource allocation issues affecting the performance of the stored procedure. Are there other processes or queries running simultaneously that could be impacting performance? Consider adjusting resource allocation or scheduling to minimize contention.

4. **Data Transformation and Aggregation**: Are there any additional data transformation or aggregation steps performed in the stored procedure that could be contributing to the longer runtime? Evaluate if these steps are necessary and if they can be optimized or moved to a different stage in the pipeline.

5. **Network Latency and Bandwidth**: Consider if network latency or bandwidth limitations could be affecting the performance of the data transfer process. Are there any network bottlenecks or constraints that could be impacting the speed of data transmission between the SQL Server and your data warehouse?

6. **Power BI Service Configuration**: Review the configuration and settings of your Power BI Service environment. Are there any limitations or constraints in terms of resource allocation, concurrency, or data transfer bandwidth that could be impacting performance?

7. **Monitoring and Logging**: Implement monitoring and logging mechanisms to track the performance of the data transfer process and identify any specific areas of concern or slowdowns. This could help pinpoint the root cause of the performance difference and guide optimization efforts.

By investigating these potential factors and exploring optimization opportunities, you may be able to improve the performance of the data transfer process when using the stored procedure. Additionally, consider reaching out to Microsoft support or community forums for further assistance and insights specific to your environment and setup.

Did I answer your question? Mark my post as a solution! Appreciate your Kudos !!

EMark · ‎05-07-2024

Hi johnbasha,

Thanks for your input, I appreciate it.

Here are my comments:

“... Check if there are any bottlenecks or limitations in the data transfer process when using the stored procedure.”

The SP is pretty simple, it just takes the data from the staging table and puts it into the reporting (production) table. The main code itself which seats in the Dataflow process also is simple with basic joins.

“**Stored Procedure Optimization**: Review the stored procedure code to ensure it's optimized for performance. Are there any inefficient queries, unnecessary joins…”

As I mentioned in my post the report is very small, just a few columns and not many rows. The code is simple without complicated calculations, joins etc. In SSMS this code runs in 10-ish seconds, as an embedded code in PBI report runs in 20-30 seconds in the Service, and on my laptop takes not more than a minute despite the fact it goes through VPN and pulls the data from the server to a local machine.

“ Concurrency and Resource Allocation**: Check if there are any concurrency or resource allocation issues affecting the performance of the stored procedure. Are there other processes or queries running simultaneously…”

No, there aren’t other processes running, it is just this one. I’m still running tests to evaluate the performance of Fabric, to explore and to confirm what kind of hurdles and potential issues are for us if we migrate all the processes to Fabric. That’s why I’m testing with the smallest and fastest report that we have. Considering what I saw so far, I cannot imagine testing with a big one which has like 1.8K rows of code and brings 96 columns and 4M rows result.

I still cannot find a logical explanation why the report updates on my local machine (which has aggravating factors like downloading the data locally through a VPN) in about a minute, while in MS Fabric environment using Dataflow Gen2 takes 6 minutes all together, where in fact all the data transfers happen in the cloud (our SQL server is cloud based) which is supposed to be the fastest performance.

We have a version of the report published in the PBI service within the same Fabric account which report has the SQL code embedded in, and this report refreshes daily in about 20-30 seconds. The other (the problematic test one) version pulls the data using Dataflow Gen2 (same code, and same environment, same Fabric account) and takes ages to update.

What we are reading about Dataflow Gen2 is that it is fast, seamless, reliable… etc. and uses the same engine as Power BI. Since it is using the same engine why do we see such a performance gap?

“… Dataflow Gen2 is built using the familiar Power Query experience that’s available today across several Microsoft products and services such as Excel, Power BI, Power Platform, Dynamics 365 Insights applications, and more…”

“..Dataflow Gen2 in Fabric is the evolution of Dataflow, redesigned from the ground up with more powerful compute engines, data movement capabilities to help you load data to several Fabric and Azure data destinations…”

“… Dataflows Gen2 are authored using Power Query. Once you publish a dataflow, the dataflow definition is generated – this is the program ….”

I’m reading that many users in different topics in this forum and across other forums have similar complaints, and as I can see one of the solutions is to use Python (PySpark) for the data load which apparently is way faster than using Dataflow.

PS:

It seems that the "staging" property of Dataflow Gen2 worsen the performance based on some articles I just read. Turns out that the "staging" is enabled by default, but trying to switch it off doesn't allow you to put the result of the dataflow into the warehouse... simply fabulous.

Dataflow Gen2 load/refresh awfully slow

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024

Jumpstart your career with the Fabric Career Hub

Dataflow Gen2 load/refresh awfully slow

Helpful resources

New forum boards available in Real-Time Intelligence.

Power BI Monthly Update - May 2024