The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Hello,
I have 2 folders containing each similar Excel files where one file is added each day on each folder.
I have 2 original queries that stack up the content of all the excels in each folder.
Then I have other queries that REFERENCE these original queries ("reference", in order to prevent PowerBI from doing te same calculation over and over for each subsequent queries, and thus saving time if I am not wrong) and apply some specific modifications on them in order to display different charts on my dashboard.
I have intermediate queries where I deactivate the loading because I don't use them in any chart on my dashboard.
All is working fine but one query is taking a very long time whenever I refresh. This specific query, in one of the transformation step, combines an "each" with another intermediate query. I am trying to understand why it takes so long. Should I activate the loading of all the queries (does it change anything ?) ? I mean if a query I eventually want to load is based on other queries, should those absolutely be loaded too to save time or it's got nothing to do ? Is there a tool in PowerBI I can use to spot where the bottleneck is ?
I have the impression that on the "each" step, everything is calculated again, whereas I REFERENCED previous queries, so for me once they have been calculated once, they shouldn't be take so long to be used against afterwards. At the end of the steps of the intermediates queries, the dataset is only like 30 lines (compares to severaly thousands originally in all my excel files), so I don't get what can cause the slowness.
This is the slow query (step #"Personnalisée ajoutée1" is combining "each" with another intermediate query) :
let
Source = SANSDUO,
#"Lignes groupées" = Table.Group(Source, {"ExtractionDate"}, {{"Nombre", each Table.RowCount(_), Int64.Type}}),
#"Lignes filtrées" = Table.SelectRows(#"Lignes groupées", each [ExtractionDate] > #date(2022, 11, 22)),
#"Lignes triées" = Table.Sort(#"Lignes filtrées",{{"ExtractionDate", Order.Ascending}}),
#"Index ajouté" = Table.AddIndexColumn(#"Lignes triées", "Index", 0, 1, Int64.Type),
#"Personnalisée ajoutée" = Table.AddColumn(#"Index ajouté", "DifferenceNbLignesSansDuoDepuisDernierExtract", each if [Index] > 0 then [Nombre] - #"Index ajouté"{[Index=[Index]-1]}[Nombre] else 0),
#"Colonnes renommées" = Table.RenameColumns(#"Personnalisée ajoutée",{{"Nombre", "NbLignesSansDUO"}}),
#"Personnalisée ajoutée1" = Table.AddColumn(#"Colonnes renommées", "NbLignesAvecDUO", each #"AVECDUO_BIS"{[Index=[Index]]}[Nombre]),
#"Premières lignes supprimées" = Table.Skip(#"Personnalisée ajoutée1",1),
#"Colonnes supprimées" = Table.RemoveColumns(#"Premières lignes supprimées",{"Index"}),
#"Type modifié" = Table.TransformColumnTypes(#"Colonnes supprimées",{{"NbLignesSansDUO", Int64.Type}, {"DifferenceNbLignesSansDuoDepuisDernierExtract", Int64.Type}})
in
#"Type modifié"
Solved! Go to Solution.
Hi @Anonymous ,
My two-pence:
" ("reference", in order to prevent PowerBI from doing te same calculation over and over for each subsequent queries, and thus saving time if I am not wrong) "
- You are wrong. Referencing a query dos not prevent Power Query from hitting the source again. There are a few edge cases where PQ will actually cache query results but it's so infrequent that you should just assume a source hit for every query, referenced or not.
" Should I activate the loading of all the queries (does it change anything ?) "
- No. If they aren't used in the front-end report, enabling the load will do nothing but take up space in your model and file size. It will not speed anything up regarding your PQ transformations.
" I have the impression that on the "each" step, everything is calculated again "
- Yes, the query that are are referencing is indeed being 'loaded' (scanned) in its entirety again in order to satisfy the matching requirements in your code. If you want to speed this up, you can buffer the query that you're referencing into memory first, so PQ doesn't need to scan all the rows remotely. In your example, it might look something like this:
bufferedRefTable = Table.Buffer(#"AVECDUO_BIS"),
#"Personnalisée ajoutée1" = Table.AddColumn(#"Colonnes renommées", "NbLignesAvecDUO", each bufferedRefTable{[Index=[Index]]}[Nombre]),
" I don't get what can cause the slowness. "
- All of the above.
Pete
Proud to be a Datanaut!
Hi @Anonymous ,
My two-pence:
" ("reference", in order to prevent PowerBI from doing te same calculation over and over for each subsequent queries, and thus saving time if I am not wrong) "
- You are wrong. Referencing a query dos not prevent Power Query from hitting the source again. There are a few edge cases where PQ will actually cache query results but it's so infrequent that you should just assume a source hit for every query, referenced or not.
" Should I activate the loading of all the queries (does it change anything ?) "
- No. If they aren't used in the front-end report, enabling the load will do nothing but take up space in your model and file size. It will not speed anything up regarding your PQ transformations.
" I have the impression that on the "each" step, everything is calculated again "
- Yes, the query that are are referencing is indeed being 'loaded' (scanned) in its entirety again in order to satisfy the matching requirements in your code. If you want to speed this up, you can buffer the query that you're referencing into memory first, so PQ doesn't need to scan all the rows remotely. In your example, it might look something like this:
bufferedRefTable = Table.Buffer(#"AVECDUO_BIS"),
#"Personnalisée ajoutée1" = Table.AddColumn(#"Colonnes renommées", "NbLignesAvecDUO", each bufferedRefTable{[Index=[Index]]}[Nombre]),
" I don't get what can cause the slowness. "
- All of the above.
Pete
Proud to be a Datanaut!
Thanks for your answer.
"Referencing a query dos not prevent Power Query from hitting the source again"
--> What is the difference between referencing a query and duplicating it then please ?
When I deactivate the loading of the query, it also greys out the element just below : "include in the report refresh". What would be the effect if I would uncheck this rather than unchecking "activate the loading" ? (Sorry if I don't have the exact terms because my Power BI desktop installed is not in english..)
I am applying your recommended code with Table.Buffer(#"AVECDUO_BIS") right now, it is still running (and not finished yet) so it doesn't seem like it hepled (much). As I said, I thought the bottleneck was there, but how can I be sure ? When my query is loading, I just see popup "Load" with a number of Mb and my excel files names that keep looping (they are read several times, thus the long time).
Also, when using Table.Buffer, how can I be sure that I won't miss elements and that in the end all my data will be used ? Can we know what Table.Buffer specificly does ?
Is it worth and riskless (I mean by that, that in the end, I want to use all the latest data), to add Table.Buffer a bit everywhere on my intermediate queries as well (within the query SANSDUO, which is the source of my problematic query, for instance ) ?
Hi @Anonymous ,
" What is the difference between referencing a query and duplicating it then please ? "
- It just means you can reuse code and 'fork' queries. For example, you could import and perform a lot of generic transformations to get the data how you want it, then reference it five times, once for each department, without having to maintain five sets of the generic code if you make a change to it (obviously not a practical example, but you get the picture).
" "include in the report refresh". What would be the effect if I would uncheck this ? "
- This is used during development so you can apply changes to your model, even when some other queries have errors or development changes that you don't want sent to the model yet. It doesn't stop a query refreshing once sent to the Service.
" I am applying your recommended code... I thought the bottleneck was there, but how can I be sure ? "
- It's difficult to be sure, but there's a number of considerations that you should make while writing your code in order to optimise it. See below for more details.
" When my query is loading, I just see popup "Load" with a number of Mb and my excel files names that keep looping (they are read several times, thus the long time). "
- This means that Power Query is having to scan your Excel files multiple times for whatever reason. Functions like merge and your #"Personnalisée ajoutée1" step require the target table to be scanned all over again in order for matches to be made.
" when using Table.Buffer, how can I be sure that I won't miss elements and that in the end all my data will be used ? Can we know what Table.Buffer specificly does ? "
- Table/List.Buffer just loads the entire target into memory for later use, instead of re-scanning the target every time some info is needed from it. For example: Imagine there is a reference book on the internet that you often need to get information from, and you have a very slow internet connection. Each time you need info you can open your browser, navigate to the web page and wait for it to load, choose the book and wait for it to load, find the page and wait for it to load, get your info then close it all down again until the next time... or, you can just download the book to your hard drive - this is what the buffer functions essentially do.
" Is it worth and riskless (I mean by that, that in the end, I want to use all the latest data), to add Table.Buffer a bit everywhere on my intermediate queries as well "
- It's worth it if they're added in the correct places. It's NOT riskless. Anything you buffer is stored in RAM so, if you decide to buffer a 3 million row table with 100 columns for no reason, this is going to use RAM that should otherwise be used for performing transformations and will ultimately have a very detrimental effect on your performance.
Here's a list of the top things ( in my experience) that negatively affect Power Query refresh performance:
Multiple source hits - unnecessary duplication or referencing of a source to create variants of tables that should probably be kept in one table and distinguished using DAX.
Whole table operations - these functions require PQ to put an entire table into memory to perform:
- Sort
- Merge
- Pivot
- Group
- Probably many more less-frequently used
Multiple scan operations - these require PQ to keep loading a target query to get the info it needs:
- Merge
- Custom calculations e.g. creating cumulative sums in PQ without knowing the 'correct' way.
- Other custom references, such as your #"Personnalisée ajoutée1" step.
In terms of your query, I can see a few of these already. You need to evaluate first whether all of these are absolutely necessary to perform, then whether they are in the optimal position in your query. For example, the first step you perform is Group By, then you filter on the grouped dates. You could optimise this a bit by filtering on the dates first, then Group By has fewer rows it needs to load to memory to perform. This is just one small example, but gives an idea about the thought process of optimisation.
Overall, it does become a bit of an art, with there not always being a right and wrong answer, but experience comes quickly.
Pete
Proud to be a Datanaut!
Thank you again for the clarifications.
I thought duplication prevented the process of recalculating everything, so it is very good to know that Buffers actually does that.
After waiting for the query to finish loading, I noticed there was actually a real amelioration in terms of time.
I am also doing what you pointed in the things that negatively affect Power Query refresh performance : create several queries, based on the same source, that perform several kinds of grouping for instance : I understand, there would be a better performance if I would do those transformations in DAX rather that in M, because the source, that consists of stacking up my excel files in my example, wouldn't be calculated several times to actually do the same thing, right ?
Is there a way to get a (graphical ?) view of what the queries actually do behind the scene in order to spot when an action is done several times for nothing for instance ? That would have allow me to notice that duplicating a query doesn't prevent from the calculation to be done twice for instance. Or at least a way to know how long each step take ?
Is it only by experience you can tell that filtering + grouping is more efficient than grouping + filtering for instance, or we can find this information somewhere or it is just logic ? Because I thought that filtering first needs to load in memory all the lines too, just like grouping needs, no ?
" there would be a better performance if I would do those transformations in DAX rather that in M..., right ? "
- Yes, but you're not going to actually do the transformations in DAX permanently. You just materialise the transformations temporarily when you need them using measures.
For example, looking at your original example query, it looks like you're trying to build a table that shows the movement in the count of [Nombre] each day. This could be visualised in DAX with a fairly standard measure and a disconnected calendar table, assuming that your original table has a start and an end date per [Nombre]:
_noofNombreOverTime =
VAR __cDate = MAX(calendar[date])
RETURN
CALCULATE(
DISTINCTCOUNT(yourTable[Nombre]),
FILTER(
yourTable,
yourTable[DateStarted] <= __cDate
&& ( yourTable[DateEnded] > __cDate || ISBLANK(yourTable[DateEnded]) )
)
)
This will visualise the change in [Nombre] count when applied against a Calendar[Date] axis, and will do it almost instantly, without the need for any additional query time during refresh.
NOTE: If you're appending multiple Excel files from folders, then this would definitely be done in Power Query. This type of thing is exactly what PQ is for. It's just the subsequent differentiation of data into separate queries, and specific calculations, that should be left to DAX.
" Is there a way to get a (graphical ?) view of what the queries actually do behind the scene in order to spot when an action is done several times for nothing for instance? "
- You can analyse your queries using the OTB Query Analyser on the Tools Tab in PQ:
https://learn.microsoft.com/en-us/power-query/query-diagnostics
However, this is honestly unnecessary in almost all circumstances once you use both PQ and DAX to their strengths.
" Is it only by experience you can tell that filtering + grouping is more efficient than grouping + filtering for instance, or we can find this information somewhere or it is just logic? "
- Maybe a bit of each. I know Group By is a whole-table operation, so logic tells me that reducing the number of rows before performing it makes sense. I don't think Table.SelectRows (filter) is a whole-table operation but, even if it is, then experience tells me that it's faster to do it in the Filter>Group order than the other way around for whatever reason.
If you want to really understand how/why things are happening in M, you can start with the language specification:
https://learn.microsoft.com/en-us/powerquery-m/power-query-m-language-specification
After that, it's largely about using logic, experimenting, and reading blogs/articles etc. One of the best blogs to really understand M optimisation (amongst many other things) is Chris Webb's over at CrossJoin:
Pete
Proud to be a Datanaut!
Hi, List/Table/Binary.Buffer didn't loose your data, they just place it in RAM.
Try this method to reference next/previous row:
let Source = Table.FromColumns({{"a","b","c","d"}},{"col"}), col = List.Buffer({null}&Source[col]), AddedIndex = Table.AddIndexColumn(Source, "ind", 0, 1), TransformToPrevRow = Table.TransformColumns(AddedIndex, {{"ind", each col{_}}}) in TransformToPrevRow
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Power BI update to learn about new features.