Solved: Re: Problem with Table.Distinct

cseu · ‎08-26-2025

Hello,

I'm trying to create a dimension table of clients in Power BI using Power Query. The goal is to have one row per email, keeping the most recent record based on the Date column.

Here is the query I'm using:
let
Source = sqlquery,
Personnalisé1 = Table.TransformColumns(Source, {{"Email", each Text.Clean(Text.Trim(_))}}),
#"Lignes triées" = Table.Sort(Personnalisé1,{{"Date", Order.Descending}, {"Email", Order.Ascending}}),
Personnalisé2 = Table.Buffer(#"Lignes triées"),
#"Doublons supprimés" = Table.Distinct(Personnalisé2, {"Email"}),
#"Colonnes supprimées" = Table.RemoveColumns(#"Doublons supprimés",{"Date"})
in
#"Colonnes supprimées"

In Power Query, everything works as expected: I only see one row per email, and it keeps the most recent record after sorting by Date.

However, once the data is loaded into the Power BI model, I still get duplicate emails. The Email values are exactly the same (I’ve applied both Text.Trim and Text.Clean), but the other columns sometimes differ. Despite that, I expected only one row per email because I used Table.Distinct on the Email column.

Because of this, I can't create a one-to-many relationship between this dimension table and the fact table (relationship stays many-to-many due to the duplicates).

Any idea why this happens?

Thanks in advance for any help!

DataNinja777 · ‎08-26-2025

Hi @cseu ,

This is a classic and subtle Power Query problem, and the cause is almost certainly Query Folding. Your logic is sound, but the Power Query engine's optimization is getting in the way. It tries to translate all your steps into a single, efficient native query (SQL, in this case) to send to your data source.

The issue arises because the combination of Table.Sort followed by Table.Distinct doesn't translate well into SQL. When you click "Close & Apply," Power Query generates a SQL query that likely uses SELECT DISTINCT [Email], .... However, a SQL DISTINCT operation doesn't guarantee which row's data will be kept for the other columns when duplicate emails exist. It simply picks one, often ignoring the ORDER BY clause you specified earlier in your steps. This is why it works correctly in the Power Query editor preview, which often operates on a smaller, in-memory subset of data, but fails during the full refresh where query folding takes over.

To solve this, you must introduce a step that Power Query cannot fold, forcing it to perform the operation in its own engine after the data is sorted and loaded. The most robust method is to use Table.Group. This pattern involves grouping by the unique key, which creates nested tables for each email, and then simply taking the first row from each nested table. Since you already sorted by date descending, that first row will always be the most recent record.

Here is the more reliable M-code to achieve your goal:

let
    Source = sqlquery,
    // Clean the email text first
    #"Cleaned Email" = Table.TransformColumns(Source, {{"Email", each Text.Clean(Text.Trim(_)), type text}}),

    // Sort the entire table so the newest record for each email is at the top
    #"Sorted Rows" = Table.Sort(#"Cleaned Email", {{"Date", Order.Descending}}),

    // Group by Email and keep all rows for each email in a nested table
    #"Grouped Rows" = Table.Group(#"Sorted Rows", {"Email"}, {{"AllData", each _, type table}}),

    // Add a new column that extracts only the first row (the most recent) from each nested table
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "MostRecentRecord", each Table.FirstN([AllData], 1)),

    // Remove the column with the nested tables, we don't need it anymore
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom", {"MostRecentRecord"}),

    // Expand the column containing the single-row tables to get your final columns
    #"Expanded MostRecentRecord" = Table.ExpandTableColumn(#"Removed Other Columns", "MostRecentRecord",
        // Replace with your actual column names from the original table
        {"Email", "Date", "Column3", "Column4"},
        // Rename them if you want, otherwise use the same names
        {"Email", "Date", "Column3", "Column4"}
    ),
    
    // Finally, remove the date column if you don't need it in the dimension
    #"Removed Date Column" = Table.RemoveColumns(#"Expanded MostRecentRecord",{"Date"})

in
    #"Removed Date Column"

This approach works because the Table.Group operation, when configured to return all rows (each _) in a nested table, is too complex to be folded into a standard SQL statement. This forces Power Query to execute the sort and then perform the grouping in its own engine. From there, Table.FirstN safely selects the correct record from each group, and expanding the result gives you the clean, unique dimension table you need.

Best regards,

View solution in original post

nathancwatkins · ‎08-26-2025

You are saying "Email". Do you mean "email address"? If so, are there multiple recipients per email?

--Nate

cseu · ‎08-27-2025

Hi @nathancwatkins ,

Yes, I meant "email address". In the original SQL data, there can be multiple rows with the same email address (for example, a client updating their info over time). That’s why I’m trying to keep only the most recent record per unique email in Power Query.

Thanks,

DataNinja777 · ‎08-26-2025

Hi @cseu ,

This is a classic and subtle Power Query problem, and the cause is almost certainly Query Folding. Your logic is sound, but the Power Query engine's optimization is getting in the way. It tries to translate all your steps into a single, efficient native query (SQL, in this case) to send to your data source.

The issue arises because the combination of Table.Sort followed by Table.Distinct doesn't translate well into SQL. When you click "Close & Apply," Power Query generates a SQL query that likely uses SELECT DISTINCT [Email], .... However, a SQL DISTINCT operation doesn't guarantee which row's data will be kept for the other columns when duplicate emails exist. It simply picks one, often ignoring the ORDER BY clause you specified earlier in your steps. This is why it works correctly in the Power Query editor preview, which often operates on a smaller, in-memory subset of data, but fails during the full refresh where query folding takes over.

To solve this, you must introduce a step that Power Query cannot fold, forcing it to perform the operation in its own engine after the data is sorted and loaded. The most robust method is to use Table.Group. This pattern involves grouping by the unique key, which creates nested tables for each email, and then simply taking the first row from each nested table. Since you already sorted by date descending, that first row will always be the most recent record.

Here is the more reliable M-code to achieve your goal:

let
    Source = sqlquery,
    // Clean the email text first
    #"Cleaned Email" = Table.TransformColumns(Source, {{"Email", each Text.Clean(Text.Trim(_)), type text}}),

    // Sort the entire table so the newest record for each email is at the top
    #"Sorted Rows" = Table.Sort(#"Cleaned Email", {{"Date", Order.Descending}}),

    // Group by Email and keep all rows for each email in a nested table
    #"Grouped Rows" = Table.Group(#"Sorted Rows", {"Email"}, {{"AllData", each _, type table}}),

    // Add a new column that extracts only the first row (the most recent) from each nested table
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "MostRecentRecord", each Table.FirstN([AllData], 1)),

    // Remove the column with the nested tables, we don't need it anymore
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom", {"MostRecentRecord"}),

    // Expand the column containing the single-row tables to get your final columns
    #"Expanded MostRecentRecord" = Table.ExpandTableColumn(#"Removed Other Columns", "MostRecentRecord",
        // Replace with your actual column names from the original table
        {"Email", "Date", "Column3", "Column4"},
        // Rename them if you want, otherwise use the same names
        {"Email", "Date", "Column3", "Column4"}
    ),
    
    // Finally, remove the date column if you don't need it in the dimension
    #"Removed Date Column" = Table.RemoveColumns(#"Expanded MostRecentRecord",{"Date"})

in
    #"Removed Date Column"

This approach works because the Table.Group operation, when configured to return all rows (each _) in a nested table, is too complex to be folded into a standard SQL statement. This forces Power Query to execute the sort and then perform the grouping in its own engine. From there, Table.FirstN safely selects the correct record from each group, and expanding the result gives you the clean, unique dimension table you need.

Best regards,

BA_Pete · ‎08-26-2025

Hi @cseu ,

@DataNinja777 has provided a great answer here with an excellent explanation. The only thing I want to add is that I would recommend using Table.Max instead of Table.FirstN here, using "Date" as the value to evaluate Table.Max over, i.e. each Table.Max([AllData], "Date"). This should ensure you only get the latest record, and don't have to mess around running the sorting/folding gauntlet.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

cseu · ‎08-27-2025

Hi @BA_Pete ,

Thanks a lot for the follow-up and your suggestion to use Table.Max instead of Table.FirstN.

I’ve tested the Table.Max method on the grouped table as you recommended, but unfortunately I’m still seeing duplicate emails in the data model after the query is loaded even though the preview in Power Query looks correct.

Thanks again for your help

BA_Pete · ‎08-27-2025

In that case, I would suggest that your email addresses are not identical.

I would start by using the Trim and Clean functions on your email column BEFORE your Group By step and see if this helps.

Also remember that Power Query M is entirely case sensitive as well, so pete@here.com is seen as distinct from Pete@here.com and pEtE@here.com, so you could also try applying Text.Lower to your email column, again, BEFORE the Group By step.

Pete

Now accepting Kudos! If my post helped you, why not give it a thumbs-up?

Proud to be a Datanaut!

v-priyankata · ‎09-11-2025

Hi @cseu

Thank you for reaching out to the Microsoft Fabric Forum Community.
@DataNinja777 @BA_Pete @nathancwatkins Thanks for the inputs.

I hope the information provided by users was helpful. can you please try those points, If you still have questions, please don't hesitate to reach out to the community.

v-priyankata · ‎09-15-2025

Hi @cseu

I wanted to check if you had the opportunity to review the information provided by user. Please feel free to contact us if you have any further questions.

cseu · ‎08-26-2025

Thanks for the detailed explanation and the suggested solution.

Unfortunately, I’ve tried implementing the Table.Group + Table.FirstN approach exactly as described, but I’m still seeing duplicate emails in the data model once the query is loaded. I also added calculated columns to check string length and the Unicode of the last character. Everything seems identical at the email level.

This makes me think the issue might come from differences in other columns, but even with the grouping logic keeping only the most recent row per email, duplicates are still present in the model, and the relationship stays many-to-many.

Let me know if there’s anything else I should try.

Problem with Table.Distinct

Helpful resources

Power BI Monthly Update - October 2025

FabCon Atlanta 2026

Fabric Data Days starts November 4th!

Problem with Table.Distinct

Helpful resources

Power BI Monthly Update - October 2025

FabCon Atlanta 2026