Reduce loading wtih a huge dataflow

Arkhos94 · ‎10-30-2024

I'm building a report that use a big dataflow table (185 000 000 lines) in importer mode

Out of these 185 000 000 lines (roughly 20 Go) I'm using only around 2 000 000 once filtered

I'm looking for a way to either :

- filter my 2 000 000 lines before loading the data, so it does not load all the 20 Go of data but only a fraction of that

- partially load my power query (in a perfect world : not reload the data and only apply the step after that) so I don't have to reload the whole 20 Go of data every time I change something to the query

Is either possible ? I googled both and could'nt find any solution

Anonymous · ‎10-30-2024

Thank you PavanLalwani

Hi, @Arkhos94

If you use a database to store data, you can limit the amount of import when connecting by:

This method still works with DataFlow.

If you use some static files such as Excel, csv:

This step mainly uses Table.FirstN. Excel doesn't have a query engine, so it defaults to all workbook rows when importing.

How to Get Your Question Answered Quickly

If it does not help, please provide more details with your desired output and pbix file without privacy information (or some sample data) .

Best Regards

Jianpeng Li

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

PavanLalwani · ‎10-30-2024

Handling such a large dataset in Power BI can be challenging, but there are ways to optimize your dataflow to reduce loading time and memory usage. Here are solutions to focus on loading only the required 2 million rows, either by filtering before data import or managing the data loading process more efficiently.

1. Filter Data in the Dataflow Itself (Preferred)

If possible, filter the data in the dataflow before it even reaches Power BI. This way, only the necessary rows are loaded into Power BI, significantly reducing memory load.

Modify the Dataflow Query: If you have control over the dataflow, add filtering steps in the dataflow query to limit the data to the required 2 million rows. If using a specific column or condition to filter (like date range or category), add these as steps in the dataflow before Power BI imports the data.
Benefits: This is the most efficient method, as it reduces the dataset size before Power BI interacts with it.

2. Use DirectQuery Instead of Import Mode

Switch to DirectQuery Mode: If filtering in the dataflow is not possible, consider using DirectQuery mode instead of Import mode. With DirectQuery, Power BI doesn’t load the entire dataset; instead, it queries the database directly for the 2 million rows you need, based on your filters.
Limitations: Be aware that DirectQuery can impact report performance depending on your database’s response time and the complexity of your queries.

3. Incremental Refresh with Parameters (if Import Mode is Required)

Use Incremental Refresh: Set up Incremental Refresh on your data to load only a subset of data based on a defined filter, such as date range or partitioning key. This way, Power BI only refreshes data that has changed or falls within the specified parameter range, instead of reloading all 185 million rows.
How to Set It Up:
1. In Power BI, go to Modeling > Manage Parameters and create parameters for filtering (e.g., a date range).
2. Set up Incremental Refresh by right-clicking your table, selecting Incremental Refresh, and configuring it with the parameters you've defined.
3. Once set up, only the relevant partition (e.g., the 2 million rows) will load and refresh, significantly reducing data load time.

4. Power Query Filtering with Table Partitions

If you cannot modify the dataflow or use DirectQuery, you can try partitioning the data in Power Query to limit the rows loaded.

Add Filtering Steps in Power Query: In Power Query, add steps early in the query to filter out rows. For instance, use Text/Date/Number filters to keep only the relevant 2 million rows.
Enable Query Folding: To improve efficiency, make sure your filtering step happens as early as possible and that it allows query folding (where the source system does the filtering instead of Power Query). This way, only the required data is transferred to Power BI.
Partition Loading: If possible, split your query into multiple smaller queries with different filtering criteria, and load them as separate tables or partitions. This can improve manageability and sometimes speed up data handling.

5. Use Aggregated Tables with Composite Models (if Import and Aggregations are Required)

Create Aggregated Tables: Load only aggregated data instead of detailed rows. If detailed data is only occasionally needed, create summary tables with relevant aggregations (like totals, counts, or averages), and use composite models to link them with the main table.
Benefits: This reduces the volume of data Power BI needs to process and allows it to handle the larger dataset only when needed.

Each of these methods has trade-offs, but filtering directly in the dataflow or using DirectQuery will generally be the most efficient. If the underlying data source is too slow for DirectQuery, consider Incremental Refresh to avoid loading the entire dataset repeatedly.

If this solution brightened your path or made things easier, please consider giving kudos. Your recognition not only uplifts those who helped but inspires others to keep contributing for the good of our community!

Arkhos94 · ‎10-30-2024

Thanks a lot for your input.

To add some information :

The data that allows me to filter my 185 millions line down to 2 millions came from other tables

My Table contains all the component of all vehicle produced by my company (with a column with the component ID and one with the vehicle ID)

I filter down to 2 millions line by merging the table with a "billing" table (to identify the country of each vehicle and filter to my country only) and a "component" table (to filter only to the component related to some specific function : tyre, cooling, ...). The billing table is part of the same dataflow of the one I have issue with, the "component" table is local. I cannot reduce the data in the dataflow itself as it is used by other country/for other components

So the option 1 (filter direct filter) seems un-usable, do you agree ?

Option 2 => Seems like a good idea, I will try and check the nex option if it does not improve things enough