Solved: Append two tables based on latest date column

Learning_Fabric · ‎08-04-2024

Hi, I am new to Data Factory and need help with the following use case:

Simplified version is that I have two tables in a lake house, [table A] where weekly data is staged and [table B] the main table where the weekly data needs to be appended.

I need a way to only append Table A's data to B if the date in A is later than the latest date in B.

Can someone please advise what would be the easiest way to achieve this? do I need a pipeline/dataflow/notebook?

Data in Table A can arrive at any point during the week. A pipeline is scheduled to bring in the weekly data file and overwrite Table A whenever the file arrives. This bit is done. I now need to check the date and append it to the main table if needed.

Thank you!

frithjof_v · ‎08-05-2024

I guess this is a similar scenario: https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-setup-incremental-refresh-with-datafl...

If you choose to follow this tutorial, you will probably need to tweak it a bit to suit your case.

View solution in original post

v-yilong-msft · ‎08-05-2024

Hi @Learning_Fabric ,

Your solution is great, @frithjof_v . It worked like a charm! Here I have another idea in mind, and I would like to share it for reference.

I think when you use pipeline you can change the join type to suit your needs.

In other areas, you can refer to what frithjof_v said for specifics.

Best Regards

Yilong Zhou

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

frithjof_v · ‎08-05-2024

I guess this is a similar scenario: https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-setup-incremental-refresh-with-datafl...

If you choose to follow this tutorial, you will probably need to tweak it a bit to suit your case.

Learning_Fabric · ‎08-07-2024

Thank you for pointing me to this solution! It was exactly what I needed! Worked perfectly for my usecase 🙂

frithjof_v · ‎08-05-2024

I think you can achieve this by using any of the tools which you mentioned.

So I think you can use a notebook, data pipeline or dataflow gen 2.

You need to schedule it to run perhaps once each day.

You need to query both table A and table B.

PS! You should not query the SQL Analytics Endpoint, because the SQL Analytics Endpoint can have some delay. You should query the Lakehouse tables / OneLake directly to get the current data.

Then you compare the max value in the Date column in each table.

If the max date in A is later than the max date in B, then you append A into B.

Otherwise do nothing.

What is the easiest way to implement this, depends on your current skillset. This can easily be done in Dataflow Gen2. It may not be the most performant and it may use more CUs on your capacity. But I think it should be easy to do this in Dataflow Gen2.

Append two tables based on latest date column

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Community Update - January 2025

New Offer! Become a Certified Fabric Data Engineer

Append two tables based on latest date column

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Community Update - January 2025