Solved: Re: what's the best way to load data using power q...

Jeanxyz · ‎04-21-2025

I have created a power bi report, this report load about 700k lines of stock transaction data from a csv file stored in my local computer. There are only 7 columns in the csv, I have loaded the data in fixed decimal and date format to minimize the table size, but still it takes 15 minutes to refresh the report in PB service.

I have run Best Practice Analyze and Vertipaq analyzer, it shows the decompressed report is only 45 MB, which is not very big. I have also removed multiple calculated columns, expanded the cache size, but the refresh time remains the same.

I am wondering if there is a way to reduce the data refresh time. I created the csv file using an API based on python codes. Will data loading get faster if I export the data into json or other formats?

* I assume the optimal solution will be to store data in Fabric lakehouse using incremental load, but this option is not available for me as I don't have a microsoft 365 or onedrive account.

lbendlin · ‎04-21-2025

The short answer is Parquet. The long answer is rather surprising.

https://www.sqlgene.com/2024/11/28/benchmarking-power-bi-import-speed-for-local-data-sources/

View solution in original post

Anonymous · ‎04-24-2025

Hi @Jeanxyz,

Thank you for reaching out in Microsoft Community Forum.

Please follow below steps to Optimize Power BI Refresh:

1.Break the 700k-row table into monthly or yearly partitions and load only what’s needed using filters or relationships.

2.Set up RangeStart and RangeEnd parameters in Power Query and define date-based filtering to only refresh new data.

3.Remove unused columns, reduce decimal precision, and simplify datetime fields to basic formats to shrink load time.

4.Load preprocessed data into a lightweight database like SQL Server Express or SQLite, then import or DirectQuery from Power BI for faster reads.

Please continue using Microsoft Community Forum.

If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

View solution in original post

ZhangKun · ‎04-24-2025

Reading a 700,000-row local .csv file takes 15 minutes, which means you should check your queries (if you just upload to the model).

Based on past experience, the following operations will seriously slow down the operation of M query:

1. Sorting (it is not very time-consuming unless there are too many sorts).

2. Merging queries and aggregating (in some specific cases, it will read the file as many times as the number of groups. For example, if there are 100 groups, the file will be read 100 times).

3. Frequent indexing (or looking up) of values in the table (for example, in the newly added column, sum all rows in the original table that are greater than the current row. In this case, if there are 100 rows, the file will be read 100 times).

View solution in original post

ZhangKun · ‎05-13-2025

I conducted a series of tests and found that the problem was that the calculation of the calculated column was too complicated. You can try to use the following formula to replace the calculated column in the fact_stocks table. It will take about 30 seconds.

2-day range = 
VAR tFilterStock = FILTER('fact_stocks', fact_stocks[Stock]= earlier(fact_stocks[Stock]))
var c_1d=maxx(filter(tFilterStock,fact_stocks[Date]<earlier(fact_stocks[Date]) ),fact_stocks[Date])
var max_=maxx(filter(tFilterStock,fact_stocks[Date]<=earlier(fact_stocks[Date]) && fact_stocks[Date]>=c_1d ),fact_stocks[High])
var low_=minx(filter(tFilterStock,fact_stocks[Date]<=earlier(fact_stocks[Date])  && fact_stocks[Date]>=c_1d ),fact_stocks[Low])
var close_=maxx(filter(tFilterStock,fact_stocks[Date]=earlier(fact_stocks[Date]) ),fact_stocks[Close])
return
divide((max_ - low_),close_)

I just filter the corresponding [Stock] first and filter it afterwards. This will greatly reduce the number of iterations.

I did not check the results, you should double check if the results are correct.

View solution in original post

Anonymous · ‎05-05-2025

Hi @Jeanxyz,

I wanted to follow up since we haven't heard back from you regarding our last response. We hope your issue has been resolved.
If the community member's answer your query, please mark it as "Accept as Solution" and select "Yes" if it was helpful.
If you need any further assistance, feel free to reach out.

Please continue using Microsoft community forum.

Thank you,
Pavan.

Anonymous · ‎05-02-2025

Hi @Jeanxyz,

I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, kindly "Accept as Solution" and give it a 'Kudos' so others can find it easily.

Thank you,
Pavan.

Anonymous · ‎04-27-2025

Hi @Jeanxyz,

I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please "Accept as Solution" and give a 'Kudos' so other members can easily find it.

Thank you,
Pavan.

ZhangKun · ‎04-24-2025

Reading a 700,000-row local .csv file takes 15 minutes, which means you should check your queries (if you just upload to the model).

Based on past experience, the following operations will seriously slow down the operation of M query:

1. Sorting (it is not very time-consuming unless there are too many sorts).

2. Merging queries and aggregating (in some specific cases, it will read the file as many times as the number of groups. For example, if there are 100 groups, the file will be read 100 times).

3. Frequent indexing (or looking up) of values in the table (for example, in the newly added column, sum all rows in the original table that are greater than the current row. In this case, if there are 100 rows, the file will be read 100 times).

Jeanxyz · ‎05-12-2025

Sorry, I had a surgery recently, hence was not able to reply in time.

I have thought about the whole picture. The slow loading is not related to merge or sorting. My power query is very simple, just loading and change types.

Parquet file doesn't solve the problem because each time I export the data, it is downloaded as a new parquet file, there is no tracking logs as each download is run in a new collab session.

I tend to think the slow import is because I'm reading a file from local pc to Power BI, which seems to be very slow.

If anyone wants to test, I have uploaded the pbix file and csv file to Onedrive. Here is the url: https://drive.google.com/drive/folders/144Hx6EiMGbJWwuE3rd7604s_GP-eqOZb?usp=sharing

@ZhangKun

ZhangKun · ‎05-12-2025

I changed the data source of the fact_stocks table in the pbix file to the .csv file you provided, and it took 5-10 seconds to refresh only fact_stocks.

I checked your power query. In the "Nasdque Symbol" query, you deduplicated the table. If the data source is a very large table, this will consume a lot of resources. Perhaps you can reduce the number of rows in the data source to determine whether the deduplication is the problem.

Hope you get well soon.

Jeanxyz · ‎05-12-2025

Did you remove the stock filter in Power Query? I have to add a filter otherwise it will take 30 mins to load 70k lines of data in csv.

ZhangKun · ‎05-13-2025

I conducted a series of tests and found that the problem was that the calculation of the calculated column was too complicated. You can try to use the following formula to replace the calculated column in the fact_stocks table. It will take about 30 seconds.

2-day range = 
VAR tFilterStock = FILTER('fact_stocks', fact_stocks[Stock]= earlier(fact_stocks[Stock]))
var c_1d=maxx(filter(tFilterStock,fact_stocks[Date]<earlier(fact_stocks[Date]) ),fact_stocks[Date])
var max_=maxx(filter(tFilterStock,fact_stocks[Date]<=earlier(fact_stocks[Date]) && fact_stocks[Date]>=c_1d ),fact_stocks[High])
var low_=minx(filter(tFilterStock,fact_stocks[Date]<=earlier(fact_stocks[Date])  && fact_stocks[Date]>=c_1d ),fact_stocks[Low])
var close_=maxx(filter(tFilterStock,fact_stocks[Date]=earlier(fact_stocks[Date]) ),fact_stocks[Close])
return
divide((max_ - low_),close_)

I just filter the corresponding [Stock] first and filter it afterwards. This will greatly reduce the number of iterations.

I did not check the results, you should double check if the results are correct.

Jeanxyz · ‎05-21-2025

Thanks a lot for sharing, @ZhangKun . The idea of creating tFilterStock table is an eye-opener for me. I guess in my previous formula, Vertipaq has to create multiple physical tables to complete the calculation, tFilterStock has created a tempory table in memory, which probably avoid the need to create multiple physical tables.

For your informatin, I used offset function, which works more less the same way.

2-day range = 

var prev_low=
calculate(max(fact_stocks[Low]),
offset(-1, 
orderby(fact_stocks[Date],ASC),
partitionby(fact_stocks[Stock])
),all(fact_stocks)
)
var prev_low_adj=
min(if(isblank(prev_low),fact_stocks[Low],prev_low),fact_stocks[Low])
var prev_high=
calculate(max(fact_stocks[High]),
offset(-1, 
orderby(fact_stocks[Date],ASC),
partitionby(fact_stocks[Stock])
),all(fact_stocks)
)
var prev_high_adj=
max(if(isblank(prev_high),fact_stocks[High],prev_high),fact_stocks[High])
return
divide(prev_high_adj-prev_low_adj, fact_stocks[Close])

I have however one more question:

how do you figure which calculated column caused the refreshing issue? I tried Dax studio, performance optimizer tool, & Power Query diagnose tool, but all the tools failed because they couldn't capture the query plan during the dataset refreshing cycle. So how can I quickly find out which calculation column has caused slow refreshing?

ZhangKun · ‎05-22-2025

Calculated columns are recalculated when you refresh the model, which happens after PowerQuery has finished uploading the data. If you want to diagnose the problem, you can use SQL Server Profiler.

ZhangKun · ‎05-12-2025

I understand what you mean. I will retest tonight; the company's network cannot access Google.

Anonymous · ‎04-22-2025

Hi @Jeanxyz,

Thank you for reaching out in Microsoft Community Forum.

Thank you @lbendlin for the helpful response.

Please follow below steps to Improve Power BI Refresh Time for Local CSV Files:

1.CSV is not optimized for large data loads. Try converting your file to Parquet or Feather using your Python script. These formats are columnar and significantly faster to load into Power BI compared to CSV or JSON.

2.Make sure transformations like type changes, filters, and column removals are pushed to the top of the query steps. Also, avoid using dynamic column renaming or merging inside Power Query unless necessary.

3.Disable Auto Date/Time for New Columns:
In Power BI Desktop → Options → Global → Data Load, disable “Auto Date/Time for new Date columns.” This avoids unnecessary hidden date tables being created behind the scenes, which can slow things down.

4.Since you’re already using Python, consider preprocessing the data—remove unnecessary rows/columns, sort by date or ID, and save the final processed version locally. Power BI will load faster if less transformation is needed post-import.

Please continue using Microsoft Community Forum.

If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

Jeanxyz · ‎04-22-2025

For your information that I have exported the data in partitioned parquet files. It still takes a long time to load into Power BI. I couldn't tell the exact time becaue the loading has been running since 17 mins ago and still not finished. According to the test below, parquet file loads about 50% faster than csv, that doesn't really solve my problem.