How can I get Synapse Notebook to only process new...

HamidBee · ‎07-25-2023

I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.

The data will be uploaded to an Azure storage container following this format for example: 2022/Week1/Week1.xlsx and 2023/Week10/Week10.xlsx. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.

HimanshuS-msft · ‎07-30-2023

Hello @HamidBee
Thanks for using the Fabric community.
I believe you will have to use the below function in the notebook to get the weeknumber dynamically every week .

weekofyear(df.colname)

Thanks
HImanshu