Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
tan_thiamhuat
Helper V
Helper V

Optimize the Dataflow processes

tan_thiamhuat_0-1746498760029.png

I have the above data pipeline, which the Copy SFTP Data and Delete SFTP Data are both OK.

But when it comes to the Dataflow for the Dimension and Fact tables, I am concerned because the source data will keep increasing, as more CSV files come in. The problem is to compute the Dimension and Fact tables, it is dependent on the source data itself.

Below shows some statistics on the number of rows from the source data and their Dimension and Fact tables.

tan_thiamhuat_1-1746499097918.png

As the number of CSV files increase from 2 to 3, I see that computation of Fact table takes from 2min to 9min to complete. So I am concerned that this computation would be accumulative as the number of files increase.

What other alternatives are there? The main issue I think here is that both the Dimension and Fact tables computation are dependent on the source data. Thanks.

1 ACCEPTED SOLUTION
v-mdharahman
Community Support
Community Support

Hi @tan_thiamhuat,

Thanks for reaching out to the Microsoft fabric community forum.

You're absolutely right to be concerned about the performance implications as more CSV files come in. From what you've described and based on the row counts and increasing duration, it looks like both your Dimension and Fact table Dataflows are recomputing everything from scratch using the full set of source data each time. That’s likely why the Fact table load time jumped from 2 minutes to 9 minutes when you added the third file.

Since your source (SFTP) data keeps growing and is a key input to the transformations, this kind of cumulative processing can lead to scaling issues over time. Even though it's technically working now, over time it'll get slower and potentially more error-prone.

One way to address this is by introducing incremental processing. Instead of processing all rows every time, you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. That gives you more control over what gets processed and can help avoid reprocessing the full dataset every run.

Also, if your source files have some kind of date or sequence indicator, partitioning the data on ingestion and only querying recent partitions during transformation can really help reduce load times.

 

If I misunderstand your needs or you still have problems on it, please feel free to let us know.  

Best Regards,
Hammad.
Community Support Team

 

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

View solution in original post

4 REPLIES 4
tan_thiamhuat
Helper V
Helper V

you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data. --> This can be achieved via Dataflow alone, without Notebook?

v-mdharahman
Community Support
Community Support

Hi @tan_thiamhuat,

Thanks for reaching out to the Microsoft fabric community forum.

You're absolutely right to be concerned about the performance implications as more CSV files come in. From what you've described and based on the row counts and increasing duration, it looks like both your Dimension and Fact table Dataflows are recomputing everything from scratch using the full set of source data each time. That’s likely why the Fact table load time jumped from 2 minutes to 9 minutes when you added the third file.

Since your source (SFTP) data keeps growing and is a key input to the transformations, this kind of cumulative processing can lead to scaling issues over time. Even though it's technically working now, over time it'll get slower and potentially more error-prone.

One way to address this is by introducing incremental processing. Instead of processing all rows every time, you can track which files or records have already been processed either by using file metadata (like file name or timestamp), maintaining a simple "processed files" log table, or applying row-level logic based on a timestamp or unique identifier. This way, your Dataflows only process new or changed data.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. That gives you more control over what gets processed and can help avoid reprocessing the full dataset every run.

Also, if your source files have some kind of date or sequence indicator, partitioning the data on ingestion and only querying recent partitions during transformation can really help reduce load times.

 

If I misunderstand your needs or you still have problems on it, please feel free to let us know.  

Best Regards,
Hammad.
Community Support Team

 

If this post helps then please mark it as a solution, so that other members find it more quickly.

Thank you.

tan_thiamhuat_1-1746541858237.png

I have updated the data pipeline as above, and it is more optimized now. Using an added column named Processed, those Dimension and Fact tables are retrieved when the Processed=0. After processing, Processed column is updated to 1 by the Stored Procedure. In this way, we do not load the total of the source data for the computation of Dimension and Fact tables.

Thanks Hammad for your different suggestions, I will definitely explore those.

Another approach is to land your raw data into a Delta table in the Lakehouse and use something like a Notebook or Dataflow Gen2 to incrementally merge new data into your Dimension and Fact layers. --> At present, we want to restrict ourselves to only Dataflow Gen 2 and Azure SQL Server, excluding Lakehouse or Notebook, as we want to reduce the storage cost and minimize the use of Notebook for maintainability.

 

I have tried the use of Notebook with Stored Procedure, and it seems satisfy my requirements, but we want to reduce using the Notebook for maintainability.

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.