Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
Crow2525
Helper I
Helper I

Speeding up filtering of large datasets

Often, I'll start with a siginficant dataset and filter/cut out columns to achieve the refined dataset for importing.  However, after sitting and waiting for ages for further slight adjustment be made, I realised that I would be better off 'drawing a line in the sand' and saving the reduced dataset as an excel workbook and then pointing power query at that.  If I need to provide an updated version of the whole chain, I would have to go to those original excel workbooks and update those manually.

 

The effect of this thought process is that I end up with 3-4 different save points of excel workbooks datasets that combine, but are not live data.  I know that I'm using PQ incorrectly and the documentation points me in the direction of dataflows, but I feel that this isn't what I'm after either.

 

In short, is there a way to achieve the above using the built in tools? 

 

 

13 REPLIES 13
v-stephen-msft
Community Support
Community Support

Hi @Crow2525 ,

 

I think you may still need to start from the data source and optimize your data source, so that the speed of filtering should be accelerated.

The fastest way to optimize your Power BI report is to limit the number of columns to only the ones you need in your data model. Go through your tables in Power Query and determine what fields are being used. Delete these columns if they are not being used in any of your reports or calculations.

 

Best Regards,

Stephen Tao

 

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

watkinnc
Super User
Super User

To get a distinct buffered table column instead of a list:

 

= Table.Buffer(Table.Distinct(getlist[[Policy Number]]))

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!
watkinnc
Super User
Super User

Reference queries DO cause the original query to evaluate, in case the data has changed since you referenced it. I would move that from the first to the subsequent queries. Meanwhile, to get your distinct list of Policy ID, add a new query named PolicyNumbers, and make the source:

 

= List.Buffer(List.Distinct(getlist[Policy Number]))

 

 Now you have a buffered list of distinct Policy numbers that you can use either as a list in selecting rows (each List.Contains...) or you can make it into a table column to use for joins.

 

I am here to tell you that these techniques will speed your queries right on up'

 

--Nate

 

 


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!

wow, thank you! I was trying a each list contains filter based on a distinct list and it was hanging. Replaced the list source with this buffered list code and my filter step almost instantly resolved!

 

I'm learning PBI and have a few instances where I try several ways to do something before finding a significantly smoother option. This is the most frustrating one so far so its a big relief to find such a succint solution. But also frustrating that I didn't find it sooner.

Yes indeed, always use each List.Contains(List.Buffer(List), [Column])--especially if you are wanting your SQL to fold--it adds a WHERE COLUMN IN (List Values).

 

---Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!
Crow2525
Helper I
Helper I

1 - getList
let
    Source = Excel.Workbook(Web.Contents(Folder & "Consolidated List.xlsm"), null, true),
    #"SelectData" = Table.SelectRows(Source, each ([Name] = "data")),
    data_DefinedName = #"SelectData"{[Item="data",Kind="DefinedName"]}[Data],
    #"Promoted Headers" = Table.PromoteHeaders(data_DefinedName, [PromoteAllScalars=true]),
    #"Removed Duplicates" = Table.Distinct(#"Promoted Headers", {"Policy Number"}),
    Custom1 = #"Removed Duplicates"
in
    Custom1
2 – Renewal List 2022
let
    Source = #"1 - getList",
    #"▶ Add  Allocations" = Source,

Nothing fancy, just grabbing data from a 35MB file with mutiple sheets and data tabs.  The above is how I have traditionally done this.  Split the query up and then referenced the first query to the second.  The issue I have is that if I add an additional step, e.g. filter.  PowerQuery appears to start the query right back from the first referenced query and re-download the 35MB file and store it.  If I put a line in the sand,  and save that query as a as a separate excel document and then reference that excel document in the second query, I get a faster working file.

let
    Source = Excel.Workbook(File.Contents("C:\Users\OneDrive\Projects\Renewal List\Working\1 - getList.xlsx"), null, true),

 

Can I achieve the same thing using incremental refresh?

Hi @Crow2525 ,

 

I still have not understood what you are trying to do by putting a line in the sand.

My suggestion was to put the filter on your 1st query, do whatever needs to be done and then remove that filter. Its best to filter the query as far up as possible, so that the steps to follow run faster and query folding can take place as well.

 

Refrence queries do not redownload, that is what reference is meant for.

If you were creating a duplicate of the first query, then it would have redownloaded the data.

 

Inceremental refresh works on dates and updates only the data that falls between the specified dates, while keeping the rest of the data static.

 

What is the source of your original data?

Also, 35MB is not a very large dataset and PBI should be able to easily compress it to about 1/4th if the right data types are defined, eventually making the refrehes faster.

Hope this helps.

 

Thank you,

Vishesh Jain

Did I answer your question?
If yes, then please mark my post as a solution!

Thank you,
Vishesh Jain

Proud to be a Super User!



No, reference queries DO redownload. It performs the whole refresh on the referenced query, and only then starts working on the downstream query. If you load a file as a data source, make some transforms, and then make a reference query, the whole original query will refresh before anything happens with the new query. Sometimes it's better to just duplicate the query or the data source, so that one query calls one file once. If you have four queries referencing one query, that one query will refresh four separate times--once for each of then four queries that reference it.  A reference query is basically all the steps of the first query, plus the steps in the new query.

https://learn.microsoft.com/en-us/power-bi/guidance/power-query-referenced-queries

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!

mmm that's interesting.  I do amend the data-types, but I've done it in the second query.  

 

My code appears to load the first query and then the second.  Are you sure that I'm referencing correctly?   The documentation on referencing also sayes that it does load both queies.

 

"Let's be clear about what this means: When a query references a second query, it's as though the steps in the second query are combined with, and run before, the steps in the first query."

Referencing Power Query queries - Power BI | Microsoft Docs

Sorry I didn't answer your question re line in the sand.  I think we're on the same page. 

 

When I reference the first query, i.e. :

let
    Source = #"1 - getList",
    #"▶ Add  Allocations" = Source,

Any further amendments to this query, will result in the query re-loading the first query's data source.  

 

Perhaps my termanology is wrong...  Which is probably confusing the issue.

I'm trying to achieve the same thing that a dataflow is, without having to use the web service:

 

"You can design the dataflow to encapsulate the source data and transformations. As the dataflow is a persisted store of data in the Power BI service, its data retrieval is fast. So, even when referencing queries result in multiple requests for the dataflow, data refresh times can be improved."

visheshjain
Solution Supplier
Solution Supplier

Hi @Crow2525,

 

I have not fully understood the problem but from what I understood you can look into incremental refresh, which will only refresh a portion of your fact table.

 

Alternatively, if you want to work on a reduced dataset, then as soon as the data is imported, you can add a step to just keep the top 1000 rows or something and once you are done with the query transformation, remove that step to get the full data.

 

Please correct me if my understanding of the problem is wrong.

 

Hope this helps.

 

Thank you,

Vishesh Jain

Did I answer your question?
If yes, then please mark my post as a solution!

Thank you,
Vishesh Jain

Proud to be a Super User!



watkinnc
Super User
Super User

Power Query will always do exactly what you want it to do, so long as you whisper the right words into the query. Can you post some of the code before the "line in the sand", please?

 

--Nate


I’m usually answering from my phone, which means the results are visualized only in my mind. You’ll need to use my answer to know that it works—but it will work!!

Helpful resources

Announcements
Sept PBI Carousel

Power BI Monthly Update - September 2024

Check out the September 2024 Power BI update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

Sept NL Carousel

Fabric Community Update - September 2024

Find out what's new and trending in the Fabric Community.

Top Solution Authors