Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Context: I have gaps in my master data - where I have cost centers in my fact table that are not present in my master data for particular fiscal year - but those cost centers are present in other years. I have developed a scenario to allow to complete the master data - by appending the GetMissingDim to the Dim Table.
When I apply transformation steps (add_dim1FilteredFinal) to the dim.1 table (i.e. the table generated from the merge query with the Dim Query) - the process is extremely slow - it appears to read in data on a row by row (record by record basis) - without or with buffering. Using the code below (GetMissingDim - Alternative 1) - a count of 10K or above will demonstrate the slowness of this approach.
When I apply the same transformation steps (add_dim1FilteredFinal) to the dim.1 table - except that I first expand the dim.1 table and then regroup using all rows, the data does not appear to be read row by row - and as such, the results are extremely fast (GetMissingDim - Alternative 2) - even with a count of 50K. .
My questions:
Why is there such a difference between a table that has been generated from a merger versus a table that has been generated from the allrows in the group by function- as in both cases, I am doing transformation to an inner table?
How can I change Alternative 1 to have performance results like Alternative 2 - so I don't have to expand and re-group?
Below you will find the sample queries that would demonstrate the issues described above:
Count: A seperate query to enter a numeric value to change how big are the tables. Issues start to be apparent with a count of 10K+
DIM: Dimension table with a missing year of data.
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"dimCC","dimFY"})
in
Source
GetMissingDim- Alternative 1 - Very slow as it processes row by row
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2019-20"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"CC","FY"}),
MergeLeftAnti = Table.NestedJoin(Source, {"CC", "FY"}, Dim, {"dimCC", "dimFY"}, "Dim", JoinKind.LeftAnti),
Merged = Table.NestedJoin(MergeLeftAnti, {"CC"}, Dim, {"dimCC"}, "dim.1", JoinKind.LeftOuter),
Add_dim1FilteredFinal = Table.AddColumn(Merged, "dim.1_Final", each let
FYFilter=[FY],
dimFYFiltered=List.Select([dim.1][dimFY], each _>FYFilter),
dimFY_Max=List.Min(dimFYFiltered),
dim1FilteredFinal=Table.SelectRows([dim.1],each [dimFY]=dimFY_Max)
in dim1FilteredFinal),
#"Expanded dim.1_Final" = Table.ExpandTableColumn(Add_dim1FilteredFinal, "dim.1_Final", {"dimCC", "dimFY"}, {"dimCC", "dimFY"})
in
#"Expanded dim.1_Final"
GetMissingDim - Alternative 2- Very quick - Same as Alternative 1 except for a expand Merged step and a Group with all rows:
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2019-20"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"CC","FY"}),
MergeLeftAnti = Table.NestedJoin(Source, {"CC", "FY"}, Dim, {"dimCC", "dimFY"}, "Dim", JoinKind.LeftAnti),
Merged = Table.NestedJoin(MergeLeftAnti, {"CC"}, Dim, {"dimCC"}, "dim.1", JoinKind.LeftOuter),
#"Expanded dim.1" = Table.ExpandTableColumn(Merged, "dim.1", {"dimCC", "dimFY"}, {"dimCC", "dimFY"}),
#"Grouped Rows" = Table.Group(#"Expanded dim.1", {"CC", "FY"}, {{"dim.1", each _, type table [CC=number, FY=text, Dim=table, dim.1=table]}}),
Add_dim1FilteredFinal = Table.AddColumn(#"Grouped Rows", "dim.1_Final", each let
FYFilter=[FY],
dimFYFiltered=List.Select([dim.1][dimFY], each _>FYFilter),
dimFY_Max=List.Min(dimFYFiltered),
dim1FilteredFinal=Table.SelectRows([dim.1],each [dimFY]=dimFY_Max)
in dim1FilteredFinal),
#"Expanded dim.1_Final" = Table.ExpandTableColumn(Add_dim1FilteredFinal, "dim.1_Final", {"dimCC", "dimFY"}, {"dimCC", "dimFY"})
in
#"Expanded dim.1_Final"
Solved! Go to Solution.
You don't need to do any appending. You can merge and fill up instead.
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2019-20"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"CC","FY"}),
#"Merged Queries" = Table.NestedJoin(Source, {"CC", "FY"}, Dim_ForSort, {"CC", "FY"}, "Dim_ForSort", JoinKind.LeftOuter),
#"Expanded Dim_ForSort" = Table.ExpandTableColumn(#"Merged Queries", "Dim_ForSort", {"Dimension Details"}, {"Dimension Details"}),
#"Sorted Rows" = Table.Buffer(Table.Sort(#"Expanded Dim_ForSort",{{"CC", Order.Ascending}, {"FY", Order.Ascending}})),
#"Filled Up" = Table.FillUp(#"Sorted Rows",{"Dimension Details"})
in
#"Filled Up"
This is an interesting question and I'm curious what the reason is too. My hunch is that it's related to how the internal optimizer builds the query where the slower version doesn't get vectorized optimally (or maybe isn't memoized optimally).
I'd guess there's an approach that's better than either version stated, but I don't fully understand the reasoning behind the transformations in GetMissingDim, so it's hard to suggest an alternative.
Hi Alexis,
In essence - I am doing a vlookup true and find the closest match greater than what is in the dimension table. The intent of the getmissingdim - is that I would append that table to the dimension table to get a complete dim table with all gaps filled.
I understand that there are other ways to do vlookup true (i.e. append then sort then fill (up or down depending on the desired results). Side note, this has similar performance than the above alternative 2 (I tested it this morning (below - I've pasted the code) - but likely a better way to do vlookup true.
But I am still curious to find out why the perfomance difference between alternative 1 and 2 - as I have other scenarios where an append would not be the right approach... interesting your hunch re vectorized vs memorized aspects - I will wait to see if there are other opinions - but that may be the best explanation!
Dim_ForSort
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x),List.Repeat({"Dim detailsY1"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x),List.Repeat({"Dim detailsY3"},x)}),{"CC","FY","Dimension Details"}),
Add_Flag = Table.AddColumn(Source, "flag", each "x")
in
Add_Flag
MissingDim_Append
let
//[
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2019-20"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"CC","FY"}),
Append_DimForSort = Table.Combine({Source, Dim_ForSort}),
BuffTable = Table.Buffer(Append_DimForSort),
SortedRows = Table.Sort(BuffTable,{{"CC", Order.Ascending}, {"FY", Order.Ascending}, {"Dimension Details", Order.Ascending}}),
FilledUp = Table.FillUp(SortedRows,{"Dimension Details"}),
FilterOutFlagX = Table.SelectRows(FilledUp, each ([flag] = null))
in
FilterOutFlagX
You don't need to do any appending. You can merge and fill up instead.
let
x=Count,
Source = Table.FromRows(List.Zip({{1..x},List.Repeat({"2018-19"},x)})&List.Zip({{1..x},List.Repeat({"2019-20"},x)})&List.Zip({{1..x},List.Repeat({"2020-21"},x)}),{"CC","FY"}),
#"Merged Queries" = Table.NestedJoin(Source, {"CC", "FY"}, Dim_ForSort, {"CC", "FY"}, "Dim_ForSort", JoinKind.LeftOuter),
#"Expanded Dim_ForSort" = Table.ExpandTableColumn(#"Merged Queries", "Dim_ForSort", {"Dimension Details"}, {"Dimension Details"}),
#"Sorted Rows" = Table.Buffer(Table.Sort(#"Expanded Dim_ForSort",{{"CC", Order.Ascending}, {"FY", Order.Ascending}})),
#"Filled Up" = Table.FillUp(#"Sorted Rows",{"Dimension Details"})
in
#"Filled Up"
Thanks - that works - I was working based on a different scenario where there was no matching values between the tables (hence the append) - but given that in this scenario, there are matching values - a merge is preferable.
Quick question on your use of buffer - my understanding is that buffer should be used before sorting a table to bring it into memory and ensure a proper sort - but you've included after completing the sort. Is there a reason for doing so? Is my understanding of buffer incorrect?
I want it in memory in the sorted order, so I buffer after the sort.
Super thank you.
I will accept your original explanation as the solution - as it is likely the most probable answer. Thanks for all the feedback.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the April 2024 Power BI update to learn about new features.