How can I get Synapse Notebook to only process newly added data?

HamidBee — Wed, 26 Jul 2023 00:37:13 GMT

I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.

The data will be uploaded to an Azure storage container following this format for example: 2022/Week1/Week1.xlsx and 2023/Week10/Week10.xlsx. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.

Re: How can I get Synapse Notebook to only process newly added data?

HimanshuS-msft — Sun, 30 Jul 2023 16:32:54 GMT

Hello @HamidBee
Thanks for using the Fabric community.
I believe you will have to use the below function in the notebook to get the weeknumber dynamically every week .

weekofyear(df.colname)

Thanks
HImanshu

topic How can I get Synapse Notebook to only process newly added data? in Data Engineering

How can I get Synapse Notebook to only process newly added data?

Re: How can I get Synapse Notebook to only process newly added data?