Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
dpc_development
Helper III
Helper III

Power Query: Merging on virtual table from starting CSV takes too long and repeats code execution

I have a starting CSV that contains a SELECT DISTINCT of all the dimension columns of my transaction data. The columns are hierarchical, so I am trying to split that further into three tables based on column distribution and frequency of reporting level required.

 

My code is similar to the following, where I create an index on a distinct combination of the first set of columns and merge it back to the original master. After repeating twice for two other sets of columns, I merge the original master id to the transaction table master id, to yield the three dimension foreign keys for a more defined star schema.

 

let
    Source = Folder.Files(#"Folder Path"),
    ReadCSV = Table.AddColumn(Source, "CSV", each
                    Table.PromoteHeaders(
                                            Csv.Document([Content], [Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None]), [PromoteAllScalars=true]
                                        )
                ),
    KeepCSV = Table.SelectColumns(ReadCSV,{"CSV"}),
    ExpandedCSV = Table.ExpandTableColumn(
                            KeepCSV, "CSV",
                            {"list", "of", "columns"},
                            {"List", "Of", "columns"}
                        ),
    ChangedType = Table.TransformColumnTypes(ExpandedCSV,{{"List", type text}, {"Of", type text}, {"Columns", type text}}),
	
	/* Lines 17 - 24 (EventScenarios - ExpandedES) is repeated similarly two more times, with some other columns */
    EventScenarios = let
                        ESSource = Table.Distinct( Table.Group(ChangedType, {"GroupColumn"}, {{"ValueColumn", each List.Max([ValueColumn])}}) ),
                        ESSorted = Table.Sort(ESSource, {{"GroupColumn", Order.Ascending}}),
                        ESFinal = Table.AddIndexColumn(ESSorted, "Id", 1, 1, Int64.Type)
                     in
                        ESFinal,
    MergedES = Table.NestedJoin(ChangedType, {"Event", "Group", "Scenario"}, EventScenarios, {"Event", "Group", "Scenario"}, "Event Scenarios", JoinKind.LeftOuter),
    ExpandedES = Table.ExpandTableColumn(MergedES, "Event Scenarios", {"Id"}, {"Event Scenario Id"})
in
    ExpandedES

 

The code above works. However, it doesn't seem to cache the previous merge operations nor the inner generation of distinct rows.

 

The original file is only 65mb (uncompressed), but the loaded data increases to around 570mb.

 

Since the dimensions are hierarchical and I include the previous generated master table's ID, in the subsequent select distinct, Power Query seems to re-process the previously generated tables again. I tried enclosing them in Table.Buffer, but that only seemed to make things worse.

 

Can someone suggest how may I improve the loading process.

6 REPLIES 6
Anonymous
Not applicable

Hi @dpc_development ,

 

Power Query automatically detects what connector to use based on the first file found in the list. 

After selecting Transform data in the Combine files dialog box, you'll be taken back to the Power Query Editor in the query that you initially created from the connection to the local folder. The output query now contains the source file name in the left-most column, along with the data from each of the source files in the remaining columns.

Please review the entire process according to this document to see if it can improve your situation.

Combine CSV files 

 

Hope it helps,


Community Support Team _ Caitlyn

If this post helps then please consider Accept it as the solution to help the other members find it more quickly.

 

 

lbendlin
Super User
Super User

Do you have to do the merge in Power Query or could the data model do the work for you?

 

There are some really good articles on Table.NestedJoin and its pitfalls.

@lbendlin I would prefer it happen in Power Query itself. When you say Data Model, do you mean in the primary data source location? I do not have access to that, only the denormalised CSV output.

No, I mean the Power BI part, after you loaded the queries you can join the tables in the data model view 

Not sure how I'd use Join in the Data Model though. I am basically normalising the flattened CSV output. So taking a distinct of some table columns in a separate table, generating Ids, replacing those columns in the flattened table with the generated Id.

I assume the Join needs to be done in Power Query itself, in this scenario.

" I am basically normalising the flattened CSV output"

 

Yes, I understand what you are trying to do. But do you really have to? Are you familiar with the concept of expanded tables?  

 

How bad is your memory pressure?  Maybe you have a solution that is looking for a problem. What would happen if you consume the CSV as is, and then let the data model do the work (even if the table is denormalized, in-memory hierarchies will be fast).

Helpful resources

Announcements
Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

October Power BI Update Carousel

Power BI Monthly Update - October 2025

Check out the October 2025 Power BI update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Top Solution Authors