The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi all,
I set up a weekly scheduled data pipeline which copies data from Azure Datalake Storage Gen2 (ADLS gen2) to the Fabric LakeHouse.
It works perfectly, it simply takes all files added last week, and moves it to Lakehouse. If a file is already in Lakehouse, it is simply overwritten, which is fine.
However, I am strugging to find the best practice on how to move the data from these files to a delta table, as this can be done in several ways (a notebook, dataflows, copy activity). It is important that data only should be appended if it is not already in the delta table.
What is the best practice way to do this? It is strange that I am not able to find a lot of infomation about this, as this is one of the most common things to occur in Fabric.
Solved! Go to Solution.
We do this in three ways;
1) if we have local copies of the files, we have a folder structure that has an unprocessed folder and a processed folder. A pipeline with a Get Metadata activity on the unprocessed folder feeding a For Each activity. Inside that is a Copy Data activity and some file copy/delete steps. (No Move File activity 😞 )
2) if all we have is a shortcut and we don't want to copy the files, we maintain a list of processed files in a table (using a notebook). In a second notebook, we then do a left anti join on the files in the shortcut against the processed list and output the list of unprocessed files. This then feeds a For Each activity to do the Copy Data and then using the first notebook append the processed file.
3) If we use a shortcut *and* keep a local copy of the processed files, then we can do somethin like 2), just subbing in a directory listing instead of the processed file table.
If this helps, please consider Accepting as a solution to help other people find it more easily.
When you move files into the Lakehouse are they not automatically written into Parquet/delta format?
We do this in three ways;
1) if we have local copies of the files, we have a folder structure that has an unprocessed folder and a processed folder. A pipeline with a Get Metadata activity on the unprocessed folder feeding a For Each activity. Inside that is a Copy Data activity and some file copy/delete steps. (No Move File activity 😞 )
2) if all we have is a shortcut and we don't want to copy the files, we maintain a list of processed files in a table (using a notebook). In a second notebook, we then do a left anti join on the files in the shortcut against the processed list and output the list of unprocessed files. This then feeds a For Each activity to do the Copy Data and then using the first notebook append the processed file.
3) If we use a shortcut *and* keep a local copy of the processed files, then we can do somethin like 2), just subbing in a directory listing instead of the processed file table.
If this helps, please consider Accepting as a solution to help other people find it more easily.