Solved: Re: Optimize the Dataflow processes

tan_thiamhuat · ‎05-05-2025

I have the above data pipeline, which the Copy SFTP Data and Delete SFTP Data are both OK.

But when it comes to the Dataflow for the Dimension and Fact tables, I am concerned because the source data will keep increasing, as more CSV files come in. The problem is to compute the Dimension and Fact tables, it is dependent on the source data itself.

Below shows some statistics on the number of rows from the source data and their Dimension and Fact tables.

As the number of CSV files increase from 2 to 3, I see that computation of Fact table takes from 2min to 9min to complete. So I am concerned that this computation would be accumulative as the number of files increase.

What other alternatives are there? The main issue I think here is that both the Dimension and Fact tables computation are dependent on the source data. Thanks.

Anonymous · ‎05-05-2025

Hi @tan_thiamhuat,

Thanks for reaching out to the Microsoft fabric community forum.

You're absolutely right to be concerned about the performance implications as more CSV files come in. From what you've described and based on the row counts and increasing duration, it looks like both your Dimension and Fact table Dataflows are recomputing everything from scratch using the full set of source data each time. That’s likely why the Fact table load time jumped from 2 minutes to 9 minutes when you added the third file.

Since your source (SFTP) data keeps growing and is a key input to the transformations, this kind of cumulative processing can lead to scaling issues over time. Even though it's technically working now, over time it'll get slower and potentially more error-prone.

One way to address this is by introducing incremental processing. Instead of processing all rows every time, you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. That gives you more control over what gets processed and can help avoid reprocessing the full dataset every run.

Also, if your source files have some kind of date or sequence indicator, partitioning the data on ingestion and only querying recent partitions during transformation can really help reduce load times.

If I misunderstand your needs or you still have problems on it, please feel free to let us know.

Best Regards,
Hammad.
Community Support Team

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

View solution in original post

tan_thiamhuat · ‎05-06-2025

you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data. --> This can be achieved via Dataflow alone, without Notebook?

Anonymous · ‎05-05-2025

Hi @tan_thiamhuat,

Thanks for reaching out to the Microsoft fabric community forum.

You're absolutely right to be concerned about the performance implications as more CSV files come in. From what you've described and based on the row counts and increasing duration, it looks like both your Dimension and Fact table Dataflows are recomputing everything from scratch using the full set of source data each time. That’s likely why the Fact table load time jumped from 2 minutes to 9 minutes when you added the third file.

Since your source (SFTP) data keeps growing and is a key input to the transformations, this kind of cumulative processing can lead to scaling issues over time. Even though it's technically working now, over time it'll get slower and potentially more error-prone.

One way to address this is by introducing incremental processing. Instead of processing all rows every time, you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. That gives you more control over what gets processed and can help avoid reprocessing the full dataset every run.

Also, if your source files have some kind of date or sequence indicator, partitioning the data on ingestion and only querying recent partitions during transformation can really help reduce load times.

If I misunderstand your needs or you still have problems on it, please feel free to let us know.

Best Regards,
Hammad.
Community Support Team

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

tan_thiamhuat · ‎05-06-2025

I have updated the data pipeline as above, and it is more optimized now. Using an added column named Processed, those Dimension and Fact tables are retrieved when the Processed=0. After processing, Processed column is updated to 1 by the Stored Procedure. In this way, we do not load the total of the source data for the computation of Dimension and Fact tables.

tan_thiamhuat · ‎05-06-2025

Thanks Hammad for your different suggestions, I will definitely explore those.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. --> At present, we want to restrict ourselves to only Dataflow Gen 2 and Azure SQL Server, excluding Lakehouse or Notebook, as we want to reduce the storage cost and minimize the use of Notebook for maintainability.

I have tried the use of Notebook with Stored Procedure, and it seems satisfy my requirements, but we want to reduce using the Notebook for maintainability.

Optimize the Dataflow processes

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

FabCon is coming to Atlanta

Optimize the Dataflow processes

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026