[Fabric] How can I load incremental data from a Po...

luis_quiroz · ‎07-10-2024

I am currently working on loading a large database from PostgreSQL to several Lakehouses within a workspace. This is part of implementing the medallion lakehouse architecture, specifically the first phase: loading data from PostgreSQL to Bronze.

During this process, I have encountered an issue where the data loading is not incremental, even though I have configured the process to use the "update" method with "append". For instance, when loading a table of 656.178 MB, every run in Data Factory has similar metrics (same duration, data read, data written). The query used by Data Factory is a "SELECT * FROM table" on every run, leading to the entire dataset being loaded each time.

I am seeking a method to load data incrementally, where only new and updated data is loaded, rather than the entire dataset each time. This would significantly optimize the data loading process.

One potential solution I am considering is connecting to the Write-Ahead Log (WAL) from PostgreSQL. This approach would enable identifying new and updated data by tracking changes in the WAL, thus allowing for incremental data loading.

Optimizing data loading processes is crucial for efficient data management in a lakehouse architecture.

Run result 1

Run result 2

The Question

However, I am unsure how to implement this in Fabric. I would greatly appreciate any guidance or insights on the following:

How can I connect to the WAL in PostgreSQL for incremental data loading?
Are there specific tools or configurations within Fabric that support this method?
Any best practices or tips for implementing incremental data loading in this context?

Thank you in advance for your help!

LuitwielerMSFT · ‎07-11-2024

Hi @luis_quiroz , to add to the post of @v-nuoc-msft;

I am a PM on the Data Factory team and focus on Dataflows Gen2.

Within Dataflows Gen2 we will be providing a native way to do incremental refresh, public preview is planned for later this year. In the mean time, we have an other solution that may work for you.

Have you looked into this solution? Pattern to incrementally amass data with Dataflow Gen2 - Microsoft Fabric | Microsoft Learn This solution is designed to check your data destination for the latest value for a specific column and leverage that as a filter on the source to trickle feed changes from your source into the destination, is that something that would help you forward?

luis_quiroz · ‎07-11-2024

Thank you for your response. We appreciate the information provided. We implemented the append method (Pattern to incrementally amass data with Dataflow Gen2 - Microsoft Fabric | Microsoft), but we did not see any difference compared to the replace method when bringing in data from PostgreSQL.

v-nuoc-msft · ‎07-10-2024

Hi @luis_quiroz

In the following links, you can learn how to incrementally load data from data Warehouse to Lakehouse in fabric.

Here are the important steps to create this solution:

Select the watermark column. Select a column in the source data table that can be used to slice new or updated records for each run.

Prepare a table to store the last watermark value in the data warehouse.

Create pipelines using workflows.

You can view the link below for more details:

Incrementally load data from Data Warehouse to Lakehouse - Microsoft Fabric | Microsoft Learn

I hope you found this helpful.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

jhgomez · ‎07-11-2024

Is there an incremental option that does not require adding a date field to each table? For example, similar to how incremental replication schemes work in services like DMS. The current solution implies that I must ensure each object has date fields or records the last update. This is something that PostgreSQL can handle by consuming the WAL, which truly guarantees the recording of database changes and replicates them integrally.

Under the current scheme, if I have a 100GB table that takes 3 hours to transfer all the data, each time I want to update or refresh the data, I will have to wait that amount of time because it performs a full refresh. This is not optimal operationally in terms of timing with the users.

luis_quiroz · ‎07-11-2024

Thank you for the information. Unfortunately, it is not possible for us to implement this solution because several of our tables do not have a watermark. What we have found works best for loading data from PostgreSQL is using the WAL.

LuitwielerMSFT · ‎07-11-2024

Not really an expert on the WAL of postgres, but would it make sense to try and leverage a native query within dataflow to the Postgresql WAL to discover what has changed and use that as an filter on the source data in combination with a append only data destination?

[Fabric] How can I load incremental data from a PostgreSQL database using Dataflows or Data Factory?

The Question

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

Join us at the 2025 Microsoft Fabric Community Conference

[Fabric] How can I load incremental data from a PostgreSQL database using Dataflows or Data Factory?

The Question

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024