<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Pyspark code running twice causing a LIVY status  = DEAD in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4074124#M3415</link>
    <description>&lt;P&gt;Hello i am running the following cell (all packages has been imported and parameters set in a previous cells, and the code is running)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = spark.sql(f"select * from LH.STOCK_Price where YEAR = {parameter_year} and MONTH &amp;lt;= {parameter_month}")

# Filter the DataFrame to include only rows where the year in TIMESTAMP is parameter
#df = df.filter((col("YEAR") == parameter_year) &amp;amp; (col("MONTH") &amp;lt;= parameter_month))

#df.cache()

# Ensure the data is partitioned appropriately
df = df.repartition("TYPE", "Vendor", col("TIMESTAMP").cast("date"))

# Define a custom function to count distinct IDs
distinct_count = F.expr("count(distinct ID)").alias("ID_DISTINCT_COUNT")

# Group by TYPE, Vendor, and TIMESTAMP (cast to date)
result = df.groupBy(
    col("TYPE"),
    col("Vendor"),
    col("TIMESTAMP").cast("date").alias("DATE")
).agg(
    count(col("ID")).alias("ID_COUNT"),
    distinct_count
)

result.cache()

# Write the result to a Delta table with partitioning
result.write.format("delta").mode("append").partitionBy("DATE").saveAsTable("LH.Count_Date_Wise_STOCKS")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The aim of my code is to read a&amp;nbsp;STOCK_Price delta table from a lakehouse for a specific year month into a dataframe df, then group by different columns and count the ID and their distinct count. I am caching the result and then writing it to an aggregate delta table (Count_Date_Wise_Stocks), but i feel that part of my code is running twice as i can see from the spark diagnostic below the cell:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="mkj1213_0-1722533244553.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143254i17EC72F344B83AF6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="mkj1213_0-1722533244553.png" alt="mkj1213_0-1722533244553.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;The second run is almost producing a session error with an error code LIVY state = Dead.&lt;BR /&gt;&lt;BR /&gt;Some statistics about my tables:&lt;BR /&gt;STOCK_Price : around 700 Billion Rows&lt;BR /&gt;df : about 25 Billlions Rows&lt;BR /&gt;Count_Date_Wise_STOCKS: around 10,000 rows only.&lt;BR /&gt;the execution of the cell is taking sometimes up to 10 hours for 6 month of data and returning a&amp;nbsp;Count_Date_Wise_STOCKS of around 4000 rows.&lt;BR /&gt;&lt;BR /&gt;Questions: What part of my code is causing the second run? how can i avoid that if it is possible?&lt;BR /&gt;Regards&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 01 Aug 2024 17:31:02 GMT</pubDate>
    <dc:creator>mkj1213</dc:creator>
    <dc:date>2024-08-01T17:31:02Z</dc:date>
    <item>
      <title>Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4074124#M3415</link>
      <description>&lt;P&gt;Hello i am running the following cell (all packages has been imported and parameters set in a previous cells, and the code is running)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = spark.sql(f"select * from LH.STOCK_Price where YEAR = {parameter_year} and MONTH &amp;lt;= {parameter_month}")

# Filter the DataFrame to include only rows where the year in TIMESTAMP is parameter
#df = df.filter((col("YEAR") == parameter_year) &amp;amp; (col("MONTH") &amp;lt;= parameter_month))

#df.cache()

# Ensure the data is partitioned appropriately
df = df.repartition("TYPE", "Vendor", col("TIMESTAMP").cast("date"))

# Define a custom function to count distinct IDs
distinct_count = F.expr("count(distinct ID)").alias("ID_DISTINCT_COUNT")

# Group by TYPE, Vendor, and TIMESTAMP (cast to date)
result = df.groupBy(
    col("TYPE"),
    col("Vendor"),
    col("TIMESTAMP").cast("date").alias("DATE")
).agg(
    count(col("ID")).alias("ID_COUNT"),
    distinct_count
)

result.cache()

# Write the result to a Delta table with partitioning
result.write.format("delta").mode("append").partitionBy("DATE").saveAsTable("LH.Count_Date_Wise_STOCKS")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The aim of my code is to read a&amp;nbsp;STOCK_Price delta table from a lakehouse for a specific year month into a dataframe df, then group by different columns and count the ID and their distinct count. I am caching the result and then writing it to an aggregate delta table (Count_Date_Wise_Stocks), but i feel that part of my code is running twice as i can see from the spark diagnostic below the cell:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="mkj1213_0-1722533244553.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143254i17EC72F344B83AF6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="mkj1213_0-1722533244553.png" alt="mkj1213_0-1722533244553.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;The second run is almost producing a session error with an error code LIVY state = Dead.&lt;BR /&gt;&lt;BR /&gt;Some statistics about my tables:&lt;BR /&gt;STOCK_Price : around 700 Billion Rows&lt;BR /&gt;df : about 25 Billlions Rows&lt;BR /&gt;Count_Date_Wise_STOCKS: around 10,000 rows only.&lt;BR /&gt;the execution of the cell is taking sometimes up to 10 hours for 6 month of data and returning a&amp;nbsp;Count_Date_Wise_STOCKS of around 4000 rows.&lt;BR /&gt;&lt;BR /&gt;Questions: What part of my code is causing the second run? how can i avoid that if it is possible?&lt;BR /&gt;Regards&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 17:31:02 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4074124#M3415</guid>
      <dc:creator>mkj1213</dc:creator>
      <dc:date>2024-08-01T17:31:02Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4075179#M3421</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/777184"&gt;@mkj1213&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I feel like the number of Jobs doesn't mean that the code has run the same number of times. I made some test based on your code (have made some modifications to fit my data). In the following images, you will see that it experienced 7 Spark jobs.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_1-1722576963149.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143581iD9ACD3076DF32847/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_1-1722576963149.png" alt="vjingzhanmsft_1-1722576963149.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_2-1722576989183.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143582iDB6E8DB5758DA3A7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_2-1722576989183.png" alt="vjingzhanmsft_2-1722576989183.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When I query data from the result delta table, it doesn't show any duplicated rows. This means the code runs only once although it shows 7 Spark jobs.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_3-1722577234596.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143596i402D1D2E84CE6F91/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_3-1722577234596.png" alt="vjingzhanmsft_3-1722577234596.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;According to my research, this "&lt;SPAN&gt;LIVY status = DEAD&lt;/SPAN&gt;"&amp;nbsp;error is more like a result of a lack of some kind of resource or a resource that goes out of limit. I find an Azure Synapse Analytics Blog related to this error as below. According to the solution in it, you can try to &lt;STRONG&gt;increase the node size of the Spark pool &lt;/STRONG&gt;which is used to run the notebook.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/livy-is-dead-and-some-logs-to-help/ba-p/1573227" target="_blank" rel="noopener"&gt;Livy is dead and some logs to help. - Microsoft Community Hub&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here are some Fabric documentations about setting Spark pool:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/workspace-admin-settings" target="_blank" rel="noopener"&gt;Workspace administration settings in Microsoft Fabric - Microsoft Fabric | Microsoft Learn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/capacity-settings-management" target="_blank" rel="noopener"&gt;Manage settings for data engineering and science capacity - Microsoft Fabric | Microsoft Learn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_4-1722578246456.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143609i0A929559708FC9DB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_4-1722578246456.png" alt="vjingzhanmsft_4-1722578246456.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this will be helpful!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Jing&lt;BR /&gt;If this post helps, please Accept it as Solution to help other members find it. Appreciate your Kudos!&lt;/P&gt;</description>
      <pubDate>Fri, 02 Aug 2024 06:12:40 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4075179#M3421</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-08-02T06:12:40Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4075243#M3422</link>
      <description>&lt;P&gt;Thanks for giving my problem some time on your end.&lt;BR /&gt;i will look into the links that you had provided in more details and get back to you. The only difference that i had noticed between the structure of your code and my code was the agg part. In my code i had made two aggregation (count and distinct count) while in mine you had only done one. Is it possible to add another aggregation method (maybe a count) and share the spark jobs screenshot (the second image that you had included in your comments.&lt;BR /&gt;&lt;BR /&gt;On the spark pool settings, it turned out that i was using large size (maybe i could have increased it to X or XX size):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="mkj1213_0-1722584013232.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1143688iA31D306045F7BE6B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="mkj1213_0-1722584013232.png" alt="mkj1213_0-1722584013232.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Regards&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 02 Aug 2024 07:33:41 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4075243#M3422</guid>
      <dc:creator>mkj1213</dc:creator>
      <dc:date>2024-08-02T07:33:41Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4088637#M3488</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/777184"&gt;@mkj1213&lt;/a&gt;&amp;nbsp;This is my testing result with two&amp;nbsp;&lt;SPAN&gt;aggregation methods. It includes 6 spark jobs the first time.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_0-1723102839948.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1147092iBD6644A3951673D2/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_0-1723102839948.png" alt="vjingzhanmsft_0-1723102839948.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;The second time:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vjingzhanmsft_1-1723103134276.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1147095i5E401805E8BF8DF8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vjingzhanmsft_1-1723103134276.png" alt="vjingzhanmsft_1-1723103134276.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 07:46:07 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4088637#M3488</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-08-08T07:46:07Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4091069#M3506</link>
      <description>&lt;P&gt;Thanks for the reply, i am wondering how large is the data in bing_covid_19_data table?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 07:45:46 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4091069#M3506</guid>
      <dc:creator>mkj1213</dc:creator>
      <dc:date>2024-08-09T07:45:46Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4097136#M3578</link>
      <description>&lt;P&gt;This is a small sample data. The underlying parquet file is around 51MB and the table has around 4.7 million rows data.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you run the code cell with less data from the same tables, will the error occur? If not, maybe you can consider splitting the data and running separately to aggregate data from different time periods.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In addition, did you try increase the node size or number of nodes? Will this have a better result?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What's more, you may try using&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-monitoring-overview" target="_blank" rel="noopener"&gt;Fabric Spark monitoring&lt;/A&gt; to find more details behind a Spark job of the notebook.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-item-recent-runs#all-runs-within-a-notebook" target="_blank" rel="noopener"&gt;View browse item's recent runs - Microsoft Fabric | Microsoft Learn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-monitor-debug" target="_blank" rel="noopener"&gt;Monitor Spark jobs within a notebook - Microsoft Fabric | Microsoft Learn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Best Regards,&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;Jing&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Aug 2024 08:22:33 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4097136#M3578</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-08-13T08:22:33Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark code running twice causing a LIVY status  = DEAD</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4110749#M3705</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/777184"&gt;@mkj1213&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Have you resolved this issue? If any of the answers provided were helpful, please consider accepting them as a solution. If you have found other solutions, we would greatly appreciate it if you could share them with us. Thank you!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Jing&lt;/P&gt;</description>
      <pubDate>Wed, 21 Aug 2024 08:38:12 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Pyspark-code-running-twice-causing-a-LIVY-status-DEAD/m-p/4110749#M3705</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-08-21T08:38:12Z</dc:date>
    </item>
  </channel>
</rss>

