<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How can I get Synapse Notebook to only process newly added data? in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-can-I-get-Synapse-Notebook-to-only-process-newly-added-data/m-p/3349817#M2104</link>
    <description>&lt;DIV class=""&gt;
&lt;DIV class="" data-post-id="76766996"&gt;
&lt;DIV class="" data-value="0"&gt;&lt;SPAN&gt;&lt;SPAN&gt;I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV class="" data-value="0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="" data-value="0"&gt;
&lt;DIV class=""&gt;
&lt;DIV class=""&gt;
&lt;P&gt;The data will be uploaded to an Azure storage container following this format for example: &lt;EM&gt;2022/Week1/Week1.xlsx&lt;/EM&gt; and &lt;EM&gt;2023/Week10/Week10.xlsx&lt;/EM&gt;. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
    <pubDate>Wed, 26 Jul 2023 00:37:13 GMT</pubDate>
    <dc:creator>HamidBee</dc:creator>
    <dc:date>2023-07-26T00:37:13Z</dc:date>
    <item>
      <title>How can I get Synapse Notebook to only process newly added data?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-can-I-get-Synapse-Notebook-to-only-process-newly-added-data/m-p/3349817#M2104</link>
      <description>&lt;DIV class=""&gt;
&lt;DIV class="" data-post-id="76766996"&gt;
&lt;DIV class="" data-value="0"&gt;&lt;SPAN&gt;&lt;SPAN&gt;I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV class="" data-value="0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="" data-value="0"&gt;
&lt;DIV class=""&gt;
&lt;DIV class=""&gt;
&lt;P&gt;The data will be uploaded to an Azure storage container following this format for example: &lt;EM&gt;2022/Week1/Week1.xlsx&lt;/EM&gt; and &lt;EM&gt;2023/Week10/Week10.xlsx&lt;/EM&gt;. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Wed, 26 Jul 2023 00:37:13 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-can-I-get-Synapse-Notebook-to-only-process-newly-added-data/m-p/3349817#M2104</guid>
      <dc:creator>HamidBee</dc:creator>
      <dc:date>2023-07-26T00:37:13Z</dc:date>
    </item>
    <item>
      <title>Re: How can I get Synapse Notebook to only process newly added data?</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/How-can-I-get-Synapse-Notebook-to-only-process-newly-added-data/m-p/3356893#M2105</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/344683"&gt;@HamidBee&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Thanks for using the Fabric community. &lt;BR /&gt;I believe you will have to use the below function in the notebook to get the weeknumber dynamically every week .&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html" target="_self"&gt;&lt;EM&gt;weekofyear(df.colname)&lt;/EM&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;HImanshu &lt;/P&gt;</description>
      <pubDate>Sun, 30 Jul 2023 16:32:54 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/How-can-I-get-Synapse-Notebook-to-only-process-newly-added-data/m-p/3356893#M2105</guid>
      <dc:creator>HimanshuS-msft</dc:creator>
      <dc:date>2023-07-30T16:32:54Z</dc:date>
    </item>
  </channel>
</rss>

