Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreGet certified in Microsoft Fabric—for free! For a limited time, get a free DP-600 exam voucher to use by the end of 2024. Register now
I am currently working on loading a large database from PostgreSQL to several Lakehouses within a workspace. This is part of implementing the medallion lakehouse architecture, specifically the first phase: loading data from PostgreSQL to Bronze.
During this process, I have encountered an issue where the data loading is not incremental, even though I have configured the process to use the "update" method with "append". For instance, when loading a table of 656.178 MB, every run in Data Factory has similar metrics (same duration, data read, data written). The query used by Data Factory is a "SELECT * FROM table" on every run, leading to the entire dataset being loaded each time.
I am seeking a method to load data incrementally, where only new and updated data is loaded, rather than the entire dataset each time. This would significantly optimize the data loading process.
One potential solution I am considering is connecting to the Write-Ahead Log (WAL) from PostgreSQL. This approach would enable identifying new and updated data by tracking changes in the WAL, thus allowing for incremental data loading.
Optimizing data loading processes is crucial for efficient data management in a lakehouse architecture.
Run result 1
Run result 2
However, I am unsure how to implement this in Fabric. I would greatly appreciate any guidance or insights on the following:
Thank you in advance for your help!
Hi @luis_quiroz , to add to the post of @v-nuoc-msft;
I am a PM on the Data Factory team and focus on Dataflows Gen2.
Within Dataflows Gen2 we will be providing a native way to do incremental refresh, public preview is planned for later this year. In the mean time, we have an other solution that may work for you.
Have you looked into this solution? Pattern to incrementally amass data with Dataflow Gen2 - Microsoft Fabric | Microsoft Learn This solution is designed to check your data destination for the latest value for a specific column and leverage that as a filter on the source to trickle feed changes from your source into the destination, is that something that would help you forward?
Thank you for your response. We appreciate the information provided. We implemented the append method (Pattern to incrementally amass data with Dataflow Gen2 - Microsoft Fabric | Microsoft), but we did not see any difference compared to the replace method when bringing in data from PostgreSQL.
Hi @luis_quiroz
In the following links, you can learn how to incrementally load data from data Warehouse to Lakehouse in fabric.
Here are the important steps to create this solution:
Select the watermark column. Select a column in the source data table that can be used to slice new or updated records for each run.
Prepare a table to store the last watermark value in the data warehouse.
Create pipelines using workflows.
You can view the link below for more details:
Incrementally load data from Data Warehouse to Lakehouse - Microsoft Fabric | Microsoft Learn
I hope you found this helpful.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Is there an incremental option that does not require adding a date field to each table? For example, similar to how incremental replication schemes work in services like DMS. The current solution implies that I must ensure each object has date fields or records the last update. This is something that PostgreSQL can handle by consuming the WAL, which truly guarantees the recording of database changes and replicates them integrally.
Under the current scheme, if I have a 100GB table that takes 3 hours to transfer all the data, each time I want to update or refresh the data, I will have to wait that amount of time because it performs a full refresh. This is not optimal operationally in terms of timing with the users.
Thank you for the information. Unfortunately, it is not possible for us to implement this solution because several of our tables do not have a watermark. What we have found works best for loading data from PostgreSQL is using the WAL.
Not really an expert on the WAL of postgres, but would it make sense to try and leverage a native query within dataflow to the Postgresql WAL to discover what has changed and use that as an filter on the source data in combination with a append only data destination?
Starting December 3, join live sessions with database experts and the Fabric product team to learn just how easy it is to get started.
Check out the November 2024 Fabric update to learn about new features.
User | Count |
---|---|
5 | |
5 | |
5 | |
4 | |
3 |